https://www.kaggle.com/datasets
Understanding the data: Import the data. Start by getting a basic understanding of the data you're working with, such as the size of the dataset, data types, and data structure.
Cleaning the data: Identify and handle any missing or corrupted data. This may involve imputing missing values, removing duplicates, or correcting errors.
Visualizing the data: Use visualizations such as histograms, box plots, scatter plots, and heat maps to gain insights into the distribution, variability, and relationships between variables. Make sure to identify the type of visualization techniques that you have used, i.e., univariate, bivariate or multivariate.
Analyzing relationships: Explore correlations and dependencies between variables using statistical measures such as correlation coefficients.
Identifying anomalies: Look for any unusual or unexpected patterns or outliers that may indicate errors or interesting phenomena (if any).
Summarizing the data: Based on the insights gained from the data exploration, formulate hypotheses about the relationships between variables. Finally, communicate the insights gained from the EDA using clear and effective narrative summaries.
#Student name: Andy Steve Lojuntin
#SID: EP0105960
# Link to dataset on Kaggle: https://www.kaggle.com/datasets/xaviernogueira/aqueduct-30-water-risk-atlas-basin-scores
# Declaring required packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pyarrow.parquet as pq
Reading parquet file requires us to use pd.read_parquet(filename, engine = enginename). Read more in the following link:
https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe
# Reading the parquet file named "aqueduct30_basin_scores.parquet" within the same directory
basin_scores = pq.read_table('aqueduct30_basin_scores.parquet')
df_basin = basin_scores.to_pandas()
df_basin.to_csv('aqueduct30_basin_scores.csv')
df_basin.head(10)
| string_id | pfaf_id | gid_1 | aqid | bws_score | bwd_score | iav_score | sev_score | gtd_score | rfr_score | ... | w_awr_ong_rrr_score | w_awr_ong_tot_score | w_awr_smc_qan_score | w_awr_smc_qal_score | w_awr_smc_rrr_score | w_awr_smc_tot_score | w_awr_tex_qan_score | w_awr_tex_qal_score | w_awr_tex_rrr_score | w_awr_tex_tot_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| aq30_id | |||||||||||||||||||||
| 0 | 111011-EGY.11_1-3365 | 111011 | EGY.11_1 | 3365 | 5.0 | 4.948243 | 4.141657 | 2.887187 | NaN | 4.180674 | ... | 1.856046 | 2.030983 | 4.581623 | 1.778723 | 2.165272 | 4.037438 | 4.431637 | 1.786984 | 2.165272 | 3.614603 |
| 1 | 111011-EGY.15_1-3365 | 111011 | EGY.15_1 | 3365 | 5.0 | 4.948243 | 4.141657 | 2.887187 | NaN | 4.180674 | ... | 1.856046 | 2.030983 | 4.581623 | 1.778723 | 2.165272 | 4.037438 | 4.431637 | 1.786984 | 2.165272 | 3.614603 |
| 2 | 111011-EGY.15_1-None | 111011 | EGY.15_1 | -9999 | 5.0 | 4.948243 | 4.141657 | 2.887187 | NaN | 4.180674 | ... | 1.856046 | 2.030983 | 4.581623 | 1.778723 | 2.165272 | 4.037438 | 4.431637 | 1.786984 | 2.165272 | 3.614603 |
| 3 | 111011-None-3365 | 111011 | -9999 | 3365 | 5.0 | 4.948243 | 4.141657 | 2.887187 | NaN | 4.180674 | ... | 1.133763 | 1.524180 | 4.581623 | 1.737420 | 1.133763 | 4.138271 | 4.431637 | 1.737420 | 1.133763 | 3.649648 |
| 4 | 111011-None-None | 111011 | -9999 | -9999 | 5.0 | 4.948243 | 4.141657 | 2.887187 | NaN | 4.180674 | ... | 1.133763 | 1.524180 | 4.581623 | 1.737420 | 1.133763 | 4.138271 | 4.431637 | 1.737420 | 1.133763 | 3.649648 |
| 5 | 111012-EGY.11_1-3365 | 111012 | EGY.11_1 | 3365 | 5.0 | 5.000000 | 4.645469 | 3.082393 | NaN | 0.000000 | ... | 2.217443 | 2.213490 | 4.422714 | 1.778723 | 2.410512 | 3.960088 | 4.277754 | 1.786984 | 2.410512 | 3.560073 |
| 6 | 111012-EGY.15_1-3365 | 111012 | EGY.15_1 | 3365 | 5.0 | 5.000000 | 4.645469 | 3.082393 | NaN | 0.000000 | ... | 2.217443 | 2.213490 | 4.422714 | 1.778723 | 2.410512 | 3.960088 | 4.277754 | 1.786984 | 2.410512 | 3.560073 |
| 7 | 111012-EGY.15_1-None | 111012 | EGY.15_1 | -9999 | 5.0 | 5.000000 | 4.645469 | 3.082393 | NaN | 0.000000 | ... | 2.217443 | 2.213490 | 4.422714 | 1.778723 | 2.410512 | 3.960088 | 4.277754 | 1.786984 | 2.410512 | 3.560073 |
| 8 | 111012-EGY.8_1-3365 | 111012 | EGY.8_1 | 3365 | 5.0 | 5.000000 | 4.645469 | 3.082393 | NaN | 0.000000 | ... | 2.217443 | 2.213490 | 4.422714 | 1.778723 | 2.410512 | 3.960088 | 4.277754 | 1.786984 | 2.410512 | 3.560073 |
| 9 | 111012-None-None | 111012 | -9999 | -9999 | 5.0 | 5.000000 | 4.645469 | 3.082393 | NaN | 0.000000 | ... | 1.778313 | 1.714568 | 4.422714 | 1.737420 | 1.778313 | 4.067284 | 4.277754 | 1.737420 | 1.778313 | 3.562805 |
10 rows × 57 columns
df_basin.shape
(68506, 57)
print(df_basin.columns)
Index(['string_id', 'pfaf_id', 'gid_1', 'aqid', 'bws_score', 'bwd_score',
'iav_score', 'sev_score', 'gtd_score', 'rfr_score', 'cfr_score',
'drr_score', 'ucw_score', 'cep_score', 'udw_score', 'usa_score',
'rri_score', 'w_awr_def_qan_score', 'w_awr_def_qal_score',
'w_awr_def_rrr_score', 'w_awr_def_tot_score', 'w_awr_agr_qan_score',
'w_awr_agr_qal_score', 'w_awr_agr_rrr_score', 'w_awr_agr_tot_score',
'w_awr_che_qan_score', 'w_awr_che_qal_score', 'w_awr_che_rrr_score',
'w_awr_che_tot_score', 'w_awr_con_qan_score', 'w_awr_con_qal_score',
'w_awr_con_rrr_score', 'w_awr_con_tot_score', 'w_awr_elp_qan_score',
'w_awr_elp_qal_score', 'w_awr_elp_rrr_score', 'w_awr_elp_tot_score',
'w_awr_fnb_qan_score', 'w_awr_fnb_qal_score', 'w_awr_fnb_rrr_score',
'w_awr_fnb_tot_score', 'w_awr_min_qan_score', 'w_awr_min_qal_score',
'w_awr_min_rrr_score', 'w_awr_min_tot_score', 'w_awr_ong_qan_score',
'w_awr_ong_qal_score', 'w_awr_ong_rrr_score', 'w_awr_ong_tot_score',
'w_awr_smc_qan_score', 'w_awr_smc_qal_score', 'w_awr_smc_rrr_score',
'w_awr_smc_tot_score', 'w_awr_tex_qan_score', 'w_awr_tex_qal_score',
'w_awr_tex_rrr_score', 'w_awr_tex_tot_score'],
dtype='object')
# Determining the amount of missing values
print(df_basin.isna())
string_id pfaf_id gid_1 aqid bws_score bwd_score iav_score \
aq30_id
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
... ... ... ... ... ... ... ...
68506 False False False False True True True
68507 False False False False True True True
68508 False False False False True True True
68509 False False False False True True True
68510 False False False False True True True
sev_score gtd_score rfr_score ... w_awr_ong_rrr_score \
aq30_id ...
0 False True False ... False
1 False True False ... False
2 False True False ... False
3 False True False ... False
4 False True False ... False
... ... ... ... ... ...
68506 True True True ... False
68507 True True True ... False
68508 True True True ... False
68509 True True True ... False
68510 True True True ... False
w_awr_ong_tot_score w_awr_smc_qan_score w_awr_smc_qal_score \
aq30_id
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
68506 False True False
68507 False True False
68508 False True False
68509 False True False
68510 False True False
w_awr_smc_rrr_score w_awr_smc_tot_score w_awr_tex_qan_score \
aq30_id
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
68506 False False True
68507 False False True
68508 False False True
68509 False False True
68510 False False True
w_awr_tex_qal_score w_awr_tex_rrr_score w_awr_tex_tot_score
aq30_id
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
68506 False False False
68507 False False False
68508 False False False
68509 False False False
68510 False False False
[68506 rows x 57 columns]
This dataset is classified by
It follows United Nations Office for Disaster Risk Reduction (UNDRR) risk element terminologies.
Complete documentation on column conventions is in the GitHub link: https://github.com/wri/aqueduct30_data_download/blob/master/metadata.md
In general, we must know that the following shortforms translate to;
For GIS data, one can also open using QGIS (Download here: https://www.qgis.org/en/site/) For WRI Water Risk 3.0 GIS dataset for time-series and geospatial analysis, the shape file can be downloaded from separate WRI website (Download Here: https://www.wri.org/data/aqueduct-global-maps-30-data)
One can open Malaysian Level 2 GIS (Highest specificity) using Python by running the lines below
# Installing geopandas and decartes packages using pip
!pip install geopandas
!pip install descartes
import geopandas as gpd
# shx_file = gpd.read_file('gadm41_MYS_2.shx')
shapefile = gpd.read_file('gadm41_MYS_2.shp')
print(shapefile)
shapefile.head()
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: geopandas in c:\users\asus\appdata\roaming\python\python39\site-packages (0.12.2)
Requirement already satisfied: shapely>=1.7 in c:\users\asus\appdata\roaming\python\python39\site-packages (from geopandas) (2.0.1)
Requirement already satisfied: packaging in c:\programdata\anaconda3\lib\site-packages (from geopandas) (21.3)
Requirement already satisfied: pyproj>=2.6.1.post1 in c:\users\asus\appdata\roaming\python\python39\site-packages (from geopandas) (3.5.0)
Requirement already satisfied: pandas>=1.0.0 in c:\programdata\anaconda3\lib\site-packages (from geopandas) (1.4.4)
Requirement already satisfied: fiona>=1.8 in c:\users\asus\appdata\roaming\python\python39\site-packages (from geopandas) (1.9.3)
Requirement already satisfied: click-plugins>=1.0 in c:\users\asus\appdata\roaming\python\python39\site-packages (from fiona>=1.8->geopandas) (1.1.1)
Requirement already satisfied: certifi in c:\programdata\anaconda3\lib\site-packages (from fiona>=1.8->geopandas) (2022.9.14)
Requirement already satisfied: attrs>=19.2.0 in c:\programdata\anaconda3\lib\site-packages (from fiona>=1.8->geopandas) (21.4.0)
Requirement already satisfied: munch>=2.3.2 in c:\users\asus\appdata\roaming\python\python39\site-packages (from fiona>=1.8->geopandas) (2.5.0)
Requirement already satisfied: cligj>=0.5 in c:\users\asus\appdata\roaming\python\python39\site-packages (from fiona>=1.8->geopandas) (0.7.2)
Requirement already satisfied: importlib-metadata in c:\programdata\anaconda3\lib\site-packages (from fiona>=1.8->geopandas) (4.11.3)
Requirement already satisfied: click~=8.0 in c:\programdata\anaconda3\lib\site-packages (from fiona>=1.8->geopandas) (8.0.4)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.0.0->geopandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.0.0->geopandas) (2022.1)
Requirement already satisfied: numpy>=1.18.5 in c:\programdata\anaconda3\lib\site-packages (from pandas>=1.0.0->geopandas) (1.21.5)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from packaging->geopandas) (3.0.9)
Requirement already satisfied: colorama in c:\programdata\anaconda3\lib\site-packages (from click~=8.0->fiona>=1.8->geopandas) (0.4.5)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from munch>=2.3.2->fiona>=1.8->geopandas) (1.16.0)
Requirement already satisfied: zipp>=0.5 in c:\programdata\anaconda3\lib\site-packages (from importlib-metadata->fiona>=1.8->geopandas) (3.8.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: descartes in c:\users\asus\appdata\roaming\python\python39\site-packages (1.1.0)
Requirement already satisfied: matplotlib in c:\programdata\anaconda3\lib\site-packages (from descartes) (3.5.2)
Requirement already satisfied: pillow>=6.2.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (9.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (1.4.2)
Requirement already satisfied: packaging>=20.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (4.25.0)
Requirement already satisfied: cycler>=0.10 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (0.11.0)
Requirement already satisfied: python-dateutil>=2.7 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (3.0.9)
Requirement already satisfied: numpy>=1.17 in c:\programdata\anaconda3\lib\site-packages (from matplotlib->descartes) (1.21.5)
Requirement already satisfied: six>=1.5 in c:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->descartes) (1.16.0)
geometry
0 POLYGON ((102.78069 1.86530, 102.78100 1.86681...
1 POLYGON ((103.92537 1.66204, 103.92528 1.66188...
2 POLYGON ((103.24667 1.76524, 103.22960 1.77210...
3 MULTIPOLYGON (((104.23465 1.35317, 104.23479 1...
4 MULTIPOLYGON (((103.97639 1.42611, 103.97556 1...
.. ...
139 POLYGON ((102.65740 4.76026, 102.65970 4.76274...
140 POLYGON ((102.94190 4.09549, 102.93780 4.09744...
141 POLYGON ((103.05190 5.21919, 103.04830 5.22183...
142 MULTIPOLYGON (((103.38660 4.86609, 103.38220 4...
143 MULTIPOLYGON (((102.76980 5.64531, 102.77250 5...
[144 rows x 1 columns]
| geometry | |
|---|---|
| 0 | POLYGON ((102.78069 1.86530, 102.78100 1.86681... |
| 1 | POLYGON ((103.92537 1.66204, 103.92528 1.66188... |
| 2 | POLYGON ((103.24667 1.76524, 103.22960 1.77210... |
| 3 | MULTIPOLYGON (((104.23465 1.35317, 104.23479 1... |
| 4 | MULTIPOLYGON (((103.97639 1.42611, 103.97556 1... |
# Show the shp file
shapefile.plot(figsize=(16,8))
<AxesSubplot:>
Considering that WRI also had included predictions for multiple risk types, we can also represent in the same manner since it is also distributed in form of QGIS file. File extraction and display is as following.
# Importing required files
shapefile_path_predict = 'aqueduct_projections_20150309.shp'
shxfile_path_predict = 'aqueduct_projections_20150309.shx'
gdf = gpd.read_file(shapefile_path_predict, shx = shxfile_path_predict)
# # Plot map
# fig, ax = plt.subplots(figsize=(10, 10))
# gdf.plot(ax=ax, column='P_scarcity', cmap='Blues', legend=True)
# ax.set_title('Aqueduct Water Scarcity')
# plt.show()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3628 try: -> 3629 return self._engine.get_loc(casted_key) 3630 except KeyError as err: C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() C:\ProgramData\Anaconda3\lib\site-packages\pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'P_scarcity' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_45060\2643460080.py in <module> 1 # Plot map 2 fig, ax = plt.subplots(figsize=(10, 10)) ----> 3 gdf.plot(ax=ax, column='P_scarcity', cmap='Blues', legend=True) 4 ax.set_title('Aqueduct Water Scarcity') 5 plt.show() ~\AppData\Roaming\Python\Python39\site-packages\geopandas\plotting.py in __call__(self, *args, **kwargs) 966 kind = kwargs.pop("kind", "geo") 967 if kind == "geo": --> 968 return plot_dataframe(data, *args, **kwargs) 969 if kind in self._pandas_kinds: 970 # Access pandas plots ~\AppData\Roaming\Python\Python39\site-packages\geopandas\plotting.py in plot_dataframe(df, column, cmap, color, ax, cax, categorical, legend, scheme, k, vmin, vmax, markersize, figsize, legend_kwds, categories, classification_kwds, missing_kwds, aspect, **style_kwds) 726 values = values.reindex(df.index) 727 else: --> 728 values = df[column] 729 730 if pd.api.types.is_categorical_dtype(values.dtype): ~\AppData\Roaming\Python\Python39\site-packages\geopandas\geodataframe.py in __getitem__(self, key) 1413 return a GeoDataFrame. 1414 """ -> 1415 result = super().__getitem__(key) 1416 geo_col = self._geometry_column_name 1417 if isinstance(result, Series) and isinstance(result.dtype, GeometryDtype): C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 3503 if self.columns.nlevels > 1: 3504 return self._getitem_multilevel(key) -> 3505 indexer = self.columns.get_loc(key) 3506 if is_integer(indexer): 3507 indexer = [indexer] C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3629 return self._engine.get_loc(casted_key) 3630 except KeyError as err: -> 3631 raise KeyError(key) from err 3632 except TypeError: 3633 # If we have a listlike key, _check_indexing_error will raise KeyError: 'P_scarcity'
Consider that we are only interested with Malaysian dataset of the water risk assessment, we can extract the relevant information as following
# Declaring partial identifiers since the string_id column is a composite of GIS
# identifier with other elements in the convention of "pfaf_id-gid_1-aqid"
MY_partial_identifier = 'MYS'
df_MY_basin = df_basin[df_basin['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_basin)
df_MY_basin.to_csv('malaysia_basin.csv', index = False)
string_id pfaf_id gid_1 aqid bws_score bwd_score \
aq30_id
34100 444037-MYS.10_1-2171 444037 MYS.10_1 2171 0.0 0.280569
34101 444037-MYS.2_1-2171 444037 MYS.2_1 2171 0.0 0.280569
34102 444037-MYS.3_1-2171 444037 MYS.3_1 2171 0.0 0.280569
34103 444037-MYS.3_1-2217 444037 MYS.3_1 2217 0.0 0.280569
34104 444037-MYS.3_1-None 444037 MYS.3_1 -9999 0.0 0.280569
... ... ... ... ... ... ...
64956 None-MYS.6_1-None -9999 MYS.6_1 -9999 NaN NaN
64957 None-MYS.8_1-2283 -9999 MYS.8_1 2283 NaN NaN
64958 None-MYS.8_1-None -9999 MYS.8_1 -9999 NaN NaN
64959 None-MYS.9_1-2251 -9999 MYS.9_1 2251 NaN NaN
64960 None-MYS.9_1-None -9999 MYS.9_1 -9999 NaN NaN
iav_score sev_score gtd_score rfr_score ... w_awr_ong_rrr_score \
aq30_id ...
34100 1.289226 1.93994 NaN 3.655798 ... 1.550645
34101 1.289226 1.93994 NaN 3.655798 ... 1.550645
34102 1.289226 1.93994 NaN 3.655798 ... 1.550645
34103 1.289226 1.93994 NaN 3.655798 ... 1.550645
34104 1.289226 1.93994 NaN 3.655798 ... 1.550645
... ... ... ... ... ... ...
64956 NaN NaN NaN NaN ... 2.401490
64957 NaN NaN NaN NaN ... 2.401490
64958 NaN NaN NaN NaN ... 2.401490
64959 NaN NaN NaN NaN ... 2.401490
64960 NaN NaN NaN NaN ... 2.401490
w_awr_ong_tot_score w_awr_smc_qan_score w_awr_smc_qal_score \
aq30_id
34100 1.015042 2.227779 3.865583
34101 1.015042 2.227779 3.865583
34102 1.015042 2.227779 3.865583
34103 1.015042 2.227779 3.865583
34104 1.015042 2.227779 3.865583
... ... ... ...
64956 3.060939 NaN 5.000000
64957 3.060939 NaN 5.000000
64958 3.060939 NaN 5.000000
64959 3.060939 NaN 5.000000
64960 3.060939 NaN 5.000000
w_awr_smc_rrr_score w_awr_smc_tot_score w_awr_tex_qan_score \
aq30_id
34100 1.794886 2.762446 2.017945
34101 1.794886 2.762446 2.017945
34102 1.794886 2.762446 2.017945
34103 1.794886 2.762446 2.017945
34104 1.794886 2.762446 2.017945
... ... ... ...
64956 2.401490 4.563450 NaN
64957 2.401490 4.563450 NaN
64958 2.401490 4.563450 NaN
64959 2.401490 4.563450 NaN
64960 2.401490 4.563450 NaN
w_awr_tex_qal_score w_awr_tex_rrr_score w_awr_tex_tot_score
aq30_id
34100 4.337399 1.794886 2.415874
34101 4.337399 1.794886 2.415874
34102 4.337399 1.794886 2.415874
34103 4.337399 1.794886 2.415874
34104 4.337399 1.794886 2.415874
... ... ... ...
64956 5.000000 2.401490 4.345176
64957 5.000000 2.401490 4.345176
64958 5.000000 2.401490 4.345176
64959 5.000000 2.401490 4.345176
64960 5.000000 2.401490 4.345176
[243 rows x 57 columns]
Begin by looking up if there is any NaN entry in the new downscaled dataset
# Checking for NaN entries
print(df_MY_basin.isna())
string_id pfaf_id gid_1 aqid bws_score bwd_score iav_score \
aq30_id
34100 False False False False False False False
34101 False False False False False False False
34102 False False False False False False False
34103 False False False False False False False
34104 False False False False False False False
... ... ... ... ... ... ... ...
64956 False False False False True True True
64957 False False False False True True True
64958 False False False False True True True
64959 False False False False True True True
64960 False False False False True True True
sev_score gtd_score rfr_score ... w_awr_ong_rrr_score \
aq30_id ...
34100 False True False ... False
34101 False True False ... False
34102 False True False ... False
34103 False True False ... False
34104 False True False ... False
... ... ... ... ... ...
64956 True True True ... False
64957 True True True ... False
64958 True True True ... False
64959 True True True ... False
64960 True True True ... False
w_awr_ong_tot_score w_awr_smc_qan_score w_awr_smc_qal_score \
aq30_id
34100 False False False
34101 False False False
34102 False False False
34103 False False False
34104 False False False
... ... ... ...
64956 False True False
64957 False True False
64958 False True False
64959 False True False
64960 False True False
w_awr_smc_rrr_score w_awr_smc_tot_score w_awr_tex_qan_score \
aq30_id
34100 False False False
34101 False False False
34102 False False False
34103 False False False
34104 False False False
... ... ... ...
64956 False False True
64957 False False True
64958 False False True
64959 False False True
64960 False False True
w_awr_tex_qal_score w_awr_tex_rrr_score w_awr_tex_tot_score
aq30_id
34100 False False False
34101 False False False
34102 False False False
34103 False False False
34104 False False False
... ... ... ...
64956 False False False
64957 False False False
64958 False False False
64959 False False False
64960 False False False
[243 rows x 57 columns]
# For more specific NaN entries identification
nan_rows_MY = df_MY_basin[df_MY_basin.isna().any(axis=1)]
print(nan_rows_MY)
string_id pfaf_id gid_1 aqid bws_score bwd_score \
aq30_id
34100 444037-MYS.10_1-2171 444037 MYS.10_1 2171 0.0 0.280569
34101 444037-MYS.2_1-2171 444037 MYS.2_1 2171 0.0 0.280569
34102 444037-MYS.3_1-2171 444037 MYS.3_1 2171 0.0 0.280569
34103 444037-MYS.3_1-2217 444037 MYS.3_1 2217 0.0 0.280569
34104 444037-MYS.3_1-None 444037 MYS.3_1 -9999 0.0 0.280569
... ... ... ... ... ... ...
64956 None-MYS.6_1-None -9999 MYS.6_1 -9999 NaN NaN
64957 None-MYS.8_1-2283 -9999 MYS.8_1 2283 NaN NaN
64958 None-MYS.8_1-None -9999 MYS.8_1 -9999 NaN NaN
64959 None-MYS.9_1-2251 -9999 MYS.9_1 2251 NaN NaN
64960 None-MYS.9_1-None -9999 MYS.9_1 -9999 NaN NaN
iav_score sev_score gtd_score rfr_score ... w_awr_ong_rrr_score \
aq30_id ...
34100 1.289226 1.93994 NaN 3.655798 ... 1.550645
34101 1.289226 1.93994 NaN 3.655798 ... 1.550645
34102 1.289226 1.93994 NaN 3.655798 ... 1.550645
34103 1.289226 1.93994 NaN 3.655798 ... 1.550645
34104 1.289226 1.93994 NaN 3.655798 ... 1.550645
... ... ... ... ... ... ...
64956 NaN NaN NaN NaN ... 2.401490
64957 NaN NaN NaN NaN ... 2.401490
64958 NaN NaN NaN NaN ... 2.401490
64959 NaN NaN NaN NaN ... 2.401490
64960 NaN NaN NaN NaN ... 2.401490
w_awr_ong_tot_score w_awr_smc_qan_score w_awr_smc_qal_score \
aq30_id
34100 1.015042 2.227779 3.865583
34101 1.015042 2.227779 3.865583
34102 1.015042 2.227779 3.865583
34103 1.015042 2.227779 3.865583
34104 1.015042 2.227779 3.865583
... ... ... ...
64956 3.060939 NaN 5.000000
64957 3.060939 NaN 5.000000
64958 3.060939 NaN 5.000000
64959 3.060939 NaN 5.000000
64960 3.060939 NaN 5.000000
w_awr_smc_rrr_score w_awr_smc_tot_score w_awr_tex_qan_score \
aq30_id
34100 1.794886 2.762446 2.017945
34101 1.794886 2.762446 2.017945
34102 1.794886 2.762446 2.017945
34103 1.794886 2.762446 2.017945
34104 1.794886 2.762446 2.017945
... ... ... ...
64956 2.401490 4.563450 NaN
64957 2.401490 4.563450 NaN
64958 2.401490 4.563450 NaN
64959 2.401490 4.563450 NaN
64960 2.401490 4.563450 NaN
w_awr_tex_qal_score w_awr_tex_rrr_score w_awr_tex_tot_score
aq30_id
34100 4.337399 1.794886 2.415874
34101 4.337399 1.794886 2.415874
34102 4.337399 1.794886 2.415874
34103 4.337399 1.794886 2.415874
34104 4.337399 1.794886 2.415874
... ... ... ...
64956 5.000000 2.401490 4.345176
64957 5.000000 2.401490 4.345176
64958 5.000000 2.401490 4.345176
64959 5.000000 2.401490 4.345176
64960 5.000000 2.401490 4.345176
[243 rows x 57 columns]
Apparently, we have a huge number of entries with at least one entry having NaN for one of the 57 entry columns. Such circumstance is not ideal for blanket data cleansing, indicating the need for more modularised data handling. Let us begin with cloning the df_MY_basin to various smaller dataframes for non-destructive data handling of NaN entries.
# For Malaysian reservoirs
df_MY_bws = df_MY_basin[['string_id','bws_score']]
print(df_MY_bws)
df_MY_bwd = df_MY_basin[['string_id','bwd_score']]
print(df_MY_bwd)
df_MY_iav = df_MY_basin[['string_id','iav_score']]
print(df_MY_iav)
df_MY_sev = df_MY_basin[['string_id','sev_score']]
print(df_MY_sev)
df_MY_gtd = df_MY_basin[['string_id','gtd_score']]
print(df_MY_gtd)
df_MY_rfr = df_MY_basin[['string_id','rfr_score']]
print(df_MY_rfr)
df_MY_cfr = df_MY_basin[['string_id','cfr_score']]
print(df_MY_cfr)
df_MY_drr = df_MY_basin[['string_id','drr_score']]
print(df_MY_drr)
df_MY_ucw = df_MY_basin[['string_id','ucw_score']]
print(df_MY_ucw)
df_MY_cep = df_MY_basin[['string_id','cep_score']]
print(df_MY_cep)
df_MY_udw = df_MY_basin[['string_id','udw_score']]
print(df_MY_udw)
df_MY_usa = df_MY_basin[['string_id','usa_score']]
print(df_MY_usa)
df_MY_rri = df_MY_basin[['string_id','rri_score']]
print(df_MY_rri)
df_MY_w_awr_def_tot_score = df_MY_basin[['string_id','w_awr_def_tot_score']]
print(df_MY_w_awr_def_tot_score)
string_id bws_score
aq30_id
34100 444037-MYS.10_1-2171 0.0
34101 444037-MYS.2_1-2171 0.0
34102 444037-MYS.3_1-2171 0.0
34103 444037-MYS.3_1-2217 0.0
34104 444037-MYS.3_1-None 0.0
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id bwd_score
aq30_id
34100 444037-MYS.10_1-2171 0.280569
34101 444037-MYS.2_1-2171 0.280569
34102 444037-MYS.3_1-2171 0.280569
34103 444037-MYS.3_1-2217 0.280569
34104 444037-MYS.3_1-None 0.280569
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id iav_score
aq30_id
34100 444037-MYS.10_1-2171 1.289226
34101 444037-MYS.2_1-2171 1.289226
34102 444037-MYS.3_1-2171 1.289226
34103 444037-MYS.3_1-2217 1.289226
34104 444037-MYS.3_1-None 1.289226
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id sev_score
aq30_id
34100 444037-MYS.10_1-2171 1.93994
34101 444037-MYS.2_1-2171 1.93994
34102 444037-MYS.3_1-2171 1.93994
34103 444037-MYS.3_1-2217 1.93994
34104 444037-MYS.3_1-None 1.93994
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id gtd_score
aq30_id
34100 444037-MYS.10_1-2171 NaN
34101 444037-MYS.2_1-2171 NaN
34102 444037-MYS.3_1-2171 NaN
34103 444037-MYS.3_1-2217 NaN
34104 444037-MYS.3_1-None NaN
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id rfr_score
aq30_id
34100 444037-MYS.10_1-2171 3.655798
34101 444037-MYS.2_1-2171 3.655798
34102 444037-MYS.3_1-2171 3.655798
34103 444037-MYS.3_1-2217 3.655798
34104 444037-MYS.3_1-None 3.655798
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id cfr_score
aq30_id
34100 444037-MYS.10_1-2171 0.541537
34101 444037-MYS.2_1-2171 0.541537
34102 444037-MYS.3_1-2171 0.541537
34103 444037-MYS.3_1-2217 0.541537
34104 444037-MYS.3_1-None 0.541537
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id drr_score
aq30_id
34100 444037-MYS.10_1-2171 2.857626
34101 444037-MYS.2_1-2171 2.857626
34102 444037-MYS.3_1-2171 2.857626
34103 444037-MYS.3_1-2217 2.857626
34104 444037-MYS.3_1-None 2.857626
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id ucw_score
aq30_id
34100 444037-MYS.10_1-2171 5.0
34101 444037-MYS.2_1-2171 5.0
34102 444037-MYS.3_1-2171 5.0
34103 444037-MYS.3_1-2217 5.0
34104 444037-MYS.3_1-None 5.0
... ... ...
64956 None-MYS.6_1-None 5.0
64957 None-MYS.8_1-2283 5.0
64958 None-MYS.8_1-None 5.0
64959 None-MYS.9_1-2251 5.0
64960 None-MYS.9_1-None 5.0
[243 rows x 2 columns]
string_id cep_score
aq30_id
34100 444037-MYS.10_1-2171 1.122045
34101 444037-MYS.2_1-2171 1.122045
34102 444037-MYS.3_1-2171 1.122045
34103 444037-MYS.3_1-2217 1.122045
34104 444037-MYS.3_1-None 1.122045
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id udw_score
aq30_id
34100 444037-MYS.10_1-2171 0.735103
34101 444037-MYS.2_1-2171 0.735103
34102 444037-MYS.3_1-2171 0.735103
34103 444037-MYS.3_1-2217 0.735103
34104 444037-MYS.3_1-None 0.735103
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id usa_score
aq30_id
34100 444037-MYS.10_1-2171 0.0
34101 444037-MYS.2_1-2171 0.0
34102 444037-MYS.3_1-2171 0.0
34103 444037-MYS.3_1-2217 0.0
34104 444037-MYS.3_1-None 0.0
... ... ...
64956 None-MYS.6_1-None NaN
64957 None-MYS.8_1-2283 NaN
64958 None-MYS.8_1-None NaN
64959 None-MYS.9_1-2251 NaN
64960 None-MYS.9_1-None NaN
[243 rows x 2 columns]
string_id rri_score
aq30_id
34100 444037-MYS.10_1-2171 1.96
34101 444037-MYS.2_1-2171 1.96
34102 444037-MYS.3_1-2171 1.96
34103 444037-MYS.3_1-2217 1.96
34104 444037-MYS.3_1-None 1.96
... ... ...
64956 None-MYS.6_1-None 1.96
64957 None-MYS.8_1-2283 1.96
64958 None-MYS.8_1-None 1.96
64959 None-MYS.9_1-2251 1.96
64960 None-MYS.9_1-None 1.96
[243 rows x 2 columns]
string_id w_awr_def_tot_score
aq30_id
34100 444037-MYS.10_1-2171 1.446886
34101 444037-MYS.2_1-2171 1.446886
34102 444037-MYS.3_1-2171 1.446886
34103 444037-MYS.3_1-2217 1.446886
34104 444037-MYS.3_1-None 1.446886
... ... ...
64956 None-MYS.6_1-None 4.738070
64957 None-MYS.8_1-2283 4.738070
64958 None-MYS.8_1-None 4.738070
64959 None-MYS.9_1-2251 4.738070
64960 None-MYS.9_1-None 4.738070
[243 rows x 2 columns]
# For global reservoirs
df_world_bws = df_basin[['string_id','bws_score']]
print(df_world_bws)
df_world_bwd = df_basin[['string_id','bwd_score']]
print(df_world_bwd)
df_world_iav = df_basin[['string_id','iav_score']]
print(df_world_iav)
df_world_sev = df_basin[['string_id','sev_score']]
print(df_world_sev)
df_world_gtd = df_basin[['string_id','gtd_score']]
print(df_world_gtd)
df_world_rfr = df_basin[['string_id','rfr_score']]
print(df_world_rfr)
df_world_cfr = df_basin[['string_id','cfr_score']]
print(df_world_cfr)
df_world_drr = df_basin[['string_id','drr_score']]
print(df_world_drr)
df_world_ucw = df_basin[['string_id','ucw_score']]
print(df_world_ucw)
df_world_cep = df_basin[['string_id','cep_score']]
print(df_world_cep)
df_world_udw = df_basin[['string_id','udw_score']]
print(df_world_udw)
df_world_usa = df_basin[['string_id','usa_score']]
print(df_world_usa)
df_world_rri = df_basin[['string_id','rri_score']]
print(df_world_rri)
df_world_w_awr_def_tot_score = df_basin[['string_id','w_awr_def_tot_score']]
print(df_world_w_awr_def_tot_score)
string_id bws_score
aq30_id
0 111011-EGY.11_1-3365 5.0
1 111011-EGY.15_1-3365 5.0
2 111011-EGY.15_1-None 5.0
3 111011-None-3365 5.0
4 111011-None-None 5.0
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id bwd_score
aq30_id
0 111011-EGY.11_1-3365 4.948243
1 111011-EGY.15_1-3365 4.948243
2 111011-EGY.15_1-None 4.948243
3 111011-None-3365 4.948243
4 111011-None-None 4.948243
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id iav_score
aq30_id
0 111011-EGY.11_1-3365 4.141657
1 111011-EGY.15_1-3365 4.141657
2 111011-EGY.15_1-None 4.141657
3 111011-None-3365 4.141657
4 111011-None-None 4.141657
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id sev_score
aq30_id
0 111011-EGY.11_1-3365 2.887187
1 111011-EGY.15_1-3365 2.887187
2 111011-EGY.15_1-None 2.887187
3 111011-None-3365 2.887187
4 111011-None-None 2.887187
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id gtd_score
aq30_id
0 111011-EGY.11_1-3365 NaN
1 111011-EGY.15_1-3365 NaN
2 111011-EGY.15_1-None NaN
3 111011-None-3365 NaN
4 111011-None-None NaN
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id rfr_score
aq30_id
0 111011-EGY.11_1-3365 4.180674
1 111011-EGY.15_1-3365 4.180674
2 111011-EGY.15_1-None 4.180674
3 111011-None-3365 4.180674
4 111011-None-None 4.180674
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id cfr_score
aq30_id
0 111011-EGY.11_1-3365 0.0
1 111011-EGY.15_1-3365 0.0
2 111011-EGY.15_1-None 0.0
3 111011-None-3365 0.0
4 111011-None-None 0.0
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id drr_score
aq30_id
0 111011-EGY.11_1-3365 NaN
1 111011-EGY.15_1-3365 NaN
2 111011-EGY.15_1-None NaN
3 111011-None-3365 NaN
4 111011-None-None NaN
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id ucw_score
aq30_id
0 111011-EGY.11_1-3365 2.046333
1 111011-EGY.15_1-3365 2.046333
2 111011-EGY.15_1-None 2.046333
3 111011-None-3365 NaN
4 111011-None-None NaN
... ... ...
68506 None-YEM.5_1-None 3.955055
68507 None-ZAF.1_1-None 2.005333
68508 None-ZAF.4_1-None 2.005333
68509 None-ZAF.9_1-2940 2.005333
68510 None-ZAF.9_1-None 2.005333
[68506 rows x 2 columns]
string_id cep_score
aq30_id
0 111011-EGY.11_1-3365 2.0
1 111011-EGY.15_1-3365 2.0
2 111011-EGY.15_1-None 2.0
3 111011-None-3365 2.0
4 111011-None-None 2.0
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id udw_score
aq30_id
0 111011-EGY.11_1-3365 0.0
1 111011-EGY.15_1-3365 0.0
2 111011-EGY.15_1-None 0.0
3 111011-None-3365 0.0
4 111011-None-None 0.0
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id usa_score
aq30_id
0 111011-EGY.11_1-3365 0.890711
1 111011-EGY.15_1-3365 0.890711
2 111011-EGY.15_1-None 0.890711
3 111011-None-3365 0.890711
4 111011-None-None 0.890711
... ... ...
68506 None-YEM.5_1-None NaN
68507 None-ZAF.1_1-None NaN
68508 None-ZAF.4_1-None NaN
68509 None-ZAF.9_1-2940 NaN
68510 None-ZAF.9_1-None NaN
[68506 rows x 2 columns]
string_id rri_score
aq30_id
0 111011-EGY.11_1-3365 2.80
1 111011-EGY.15_1-3365 2.80
2 111011-EGY.15_1-None 2.80
3 111011-None-3365 NaN
4 111011-None-None NaN
... ... ...
68506 None-YEM.5_1-None 4.60
68507 None-ZAF.1_1-None 1.64
68508 None-ZAF.4_1-None 1.64
68509 None-ZAF.9_1-2940 1.64
68510 None-ZAF.9_1-None 1.64
[68506 rows x 2 columns]
string_id w_awr_def_tot_score
aq30_id
0 111011-EGY.11_1-3365 4.167781
1 111011-EGY.15_1-3365 4.167781
2 111011-EGY.15_1-None 4.167781
3 111011-None-3365 4.226421
4 111011-None-None 4.226421
... ... ...
68506 None-YEM.5_1-None 4.605401
68507 None-ZAF.1_1-None 2.652450
68508 None-ZAF.4_1-None 2.652450
68509 None-ZAF.9_1-2940 2.652450
68510 None-ZAF.9_1-None 2.652450
[68506 rows x 2 columns]
# Importing the reducing library
from functools import reduce
dfl_merge = [df_world_bws, df_world_bwd, df_world_iav, df_world_sev, df_world_gtd, df_world_rfr, df_world_cfr,
df_world_drr, df_world_ucw, df_world_cep, df_world_udw, df_world_usa, df_world_rri,
df_world_w_awr_def_tot_score]
df_reduced_basin = reduce(lambda left,right: pd.merge(left, right , on = ['string_id'], how = 'outer'), dfl_merge)
print(df_reduced_basin)
df_cleaned_basin.to_csv('df_reduced_basin.csv', index = False)
string_id bws_score bwd_score iav_score sev_score \
0 111011-EGY.11_1-3365 5.0 4.948243 4.141657 2.887187
1 111011-EGY.15_1-3365 5.0 4.948243 4.141657 2.887187
2 111011-EGY.15_1-None 5.0 4.948243 4.141657 2.887187
3 111011-None-3365 5.0 4.948243 4.141657 2.887187
4 111011-None-None 5.0 4.948243 4.141657 2.887187
... ... ... ... ... ...
68501 None-YEM.5_1-None NaN NaN NaN NaN
68502 None-ZAF.1_1-None NaN NaN NaN NaN
68503 None-ZAF.4_1-None NaN NaN NaN NaN
68504 None-ZAF.9_1-2940 NaN NaN NaN NaN
68505 None-ZAF.9_1-None NaN NaN NaN NaN
gtd_score rfr_score cfr_score drr_score ucw_score cep_score \
0 NaN 4.180674 0.0 NaN 2.046333 2.0
1 NaN 4.180674 0.0 NaN 2.046333 2.0
2 NaN 4.180674 0.0 NaN 2.046333 2.0
3 NaN 4.180674 0.0 NaN NaN 2.0
4 NaN 4.180674 0.0 NaN NaN 2.0
... ... ... ... ... ... ...
68501 NaN NaN NaN NaN 3.955055 NaN
68502 NaN NaN NaN NaN 2.005333 NaN
68503 NaN NaN NaN NaN 2.005333 NaN
68504 NaN NaN NaN NaN 2.005333 NaN
68505 NaN NaN NaN NaN 2.005333 NaN
udw_score usa_score rri_score w_awr_def_tot_score
0 0.0 0.890711 2.80 4.167781
1 0.0 0.890711 2.80 4.167781
2 0.0 0.890711 2.80 4.167781
3 0.0 0.890711 NaN 4.226421
4 0.0 0.890711 NaN 4.226421
... ... ... ... ...
68501 NaN NaN 4.60 4.605401
68502 NaN NaN 1.64 2.652450
68503 NaN NaN 1.64 2.652450
68504 NaN NaN 1.64 2.652450
68505 NaN NaN 1.64 2.652450
[68506 rows x 15 columns]
# For global reservoirs
ls_clean = []
for i, df in enumerate(dfl_merge):
new_df_name = 'df{}_cleaned'.format(i+1)
globals()[new_df_name] = df.dropna()
if new_df_name not in ls_clean:
ls_clean.append(new_df_name)
print(new_df_name)
print(ls_clean)
df1_cleaned df2_cleaned df3_cleaned df4_cleaned df5_cleaned df6_cleaned df7_cleaned df8_cleaned df9_cleaned df10_cleaned df11_cleaned df12_cleaned df13_cleaned df14_cleaned ['df1_cleaned', 'df2_cleaned', 'df3_cleaned', 'df4_cleaned', 'df5_cleaned', 'df6_cleaned', 'df7_cleaned', 'df8_cleaned', 'df9_cleaned', 'df10_cleaned', 'df11_cleaned', 'df12_cleaned', 'df13_cleaned', 'df14_cleaned']
df_cleaned_names = [df1_cleaned,df2_cleaned,df3_cleaned,df4_cleaned,df5_cleaned,df6_cleaned,df7_cleaned,
df8_cleaned,df9_cleaned,df10_cleaned,df11_cleaned,df12_cleaned,df13_cleaned,df14_cleaned]
dfl_merge_new_name = ['df_world_bws_cleaned', 'df_world_bwd_cleaned', 'df_world_iav_cleaned', 'df_world_sev_cleaned', 'df_world_gtd_cleaned', 'df_world_rfr_cleaned', 'df_world_cfr_cleaned',
'df_world_drr_cleaned', 'df_world_ucw_cleaned', 'df_world_cep_cleaned', 'df_world_udw_cleaned', 'df_world_usa_cleaned', 'df_world_rri_cleaned',
'df_world_w_awr_def_tot_score_cleaned']
new_clean_risk = []
for i,df in enumerate(df_cleaned_names):
new_df_name = dfl_merge_new_name[i]
new_df = df.copy()
globals()[new_df_name] = new_df
new_clean_risk.append(new_df)
print(new_clean_risk)
[ string_id bws_score aq30_id 0 111011-EGY.11_1-3365 5.0 1 111011-EGY.15_1-3365 5.0 2 111011-EGY.15_1-None 5.0 3 111011-None-3365 5.0 4 111011-None-None 5.0 ... ... ... 63346 914900-None-17 5.0 63347 914900-None-19 5.0 63348 914900-None-21 5.0 63349 914900-None-26 5.0 63350 914900-None-None 5.0 [62900 rows x 2 columns], string_id bwd_score aq30_id 0 111011-EGY.11_1-3365 4.948243 1 111011-EGY.15_1-3365 4.948243 2 111011-EGY.15_1-None 4.948243 3 111011-None-3365 4.948243 4 111011-None-None 4.948243 ... ... ... 63346 914900-None-17 5.000000 63347 914900-None-19 5.000000 63348 914900-None-21 5.000000 63349 914900-None-26 5.000000 63350 914900-None-None 5.000000 [62900 rows x 2 columns], string_id iav_score aq30_id 0 111011-EGY.11_1-3365 4.141657 1 111011-EGY.15_1-3365 4.141657 2 111011-EGY.15_1-None 4.141657 3 111011-None-3365 4.141657 4 111011-None-None 4.141657 ... ... ... 62040 863099-CAN.8_1-9 2.905139 62041 863099-CAN.8_1-None 2.905139 62042 863099-None-222 2.905139 62043 863099-None-9 2.905139 62044 863099-None-None 2.905139 [61490 rows x 2 columns], string_id sev_score aq30_id 0 111011-EGY.11_1-3365 2.887187 1 111011-EGY.15_1-3365 2.887187 2 111011-EGY.15_1-None 2.887187 3 111011-None-3365 2.887187 4 111011-None-None 2.887187 ... ... ... 62040 863099-CAN.8_1-9 1.235802 62041 863099-CAN.8_1-None 1.235802 62042 863099-None-222 1.235802 62043 863099-None-9 1.235802 62044 863099-None-None 1.235802 [61490 rows x 2 columns], string_id gtd_score aq30_id 17 111014-EGY.2_1-1732 1.028479 23 111015-EGY.10_1-1732 1.028479 25 111015-EGY.2_1-1732 1.028479 26 111015-EGY.2_1-1775 1.162628 29 111015-None-1775 1.162628 ... ... ... 68390 None-USA.44_1-1400 1.602974 68392 None-USA.44_1-1722 1.188728 68396 None-USA.47_1-1400 1.602974 68406 None-USA.8_1-1400 1.602974 68434 None-VNM.25_1-2122 2.165285 [8266 rows x 2 columns], string_id rfr_score aq30_id 0 111011-EGY.11_1-3365 4.180674 1 111011-EGY.15_1-3365 4.180674 2 111011-EGY.15_1-None 4.180674 3 111011-None-3365 4.180674 4 111011-None-None 4.180674 ... ... ... 63346 914900-None-17 0.000000 63347 914900-None-19 0.000000 63348 914900-None-21 0.000000 63349 914900-None-26 0.000000 63350 914900-None-None 0.000000 [63345 rows x 2 columns], string_id cfr_score aq30_id 0 111011-EGY.11_1-3365 0.0 1 111011-EGY.15_1-3365 0.0 2 111011-EGY.15_1-None 0.0 3 111011-None-3365 0.0 4 111011-None-None 0.0 ... ... ... 63346 914900-None-17 0.0 63347 914900-None-19 0.0 63348 914900-None-21 0.0 63349 914900-None-26 0.0 63350 914900-None-None 0.0 [63345 rows x 2 columns], string_id drr_score aq30_id 89 111081-ERI.2_1-3365 2.215140 90 111081-ERI.6_1-3365 2.215140 91 111081-None-None 2.215140 92 111081-SDN.11_1-1775 2.215140 93 111081-SDN.11_1-1930 2.215140 ... ... ... 61194 832808-CAN.3_1-352 1.512279 61195 832808-CAN.8_1-352 1.512279 61196 832809-CAN.12_1-352 1.473108 61197 832809-CAN.3_1-352 1.473108 61198 832809-CAN.8_1-352 1.473108 [52607 rows x 2 columns], string_id ucw_score aq30_id 0 111011-EGY.11_1-3365 2.046333 1 111011-EGY.15_1-3365 2.046333 2 111011-EGY.15_1-None 2.046333 5 111012-EGY.11_1-3365 2.046333 6 111012-EGY.15_1-3365 2.046333 ... ... ... 68506 None-YEM.5_1-None 3.955055 68507 None-ZAF.1_1-None 2.005333 68508 None-ZAF.4_1-None 2.005333 68509 None-ZAF.9_1-2940 2.005333 68510 None-ZAF.9_1-None 2.005333 [57699 rows x 2 columns], string_id cep_score aq30_id 0 111011-EGY.11_1-3365 2.000000 1 111011-EGY.15_1-3365 2.000000 2 111011-EGY.15_1-None 2.000000 3 111011-None-3365 2.000000 4 111011-None-None 2.000000 ... ... ... 63346 914900-None-17 1.983534 63347 914900-None-19 1.983534 63348 914900-None-21 1.983534 63349 914900-None-26 1.983534 63350 914900-None-None 1.983534 [62086 rows x 2 columns], string_id udw_score aq30_id 0 111011-EGY.11_1-3365 0.0 1 111011-EGY.15_1-3365 0.0 2 111011-EGY.15_1-None 0.0 3 111011-None-3365 0.0 4 111011-None-None 0.0 ... ... ... 63346 914900-None-17 0.0 63347 914900-None-19 0.0 63348 914900-None-21 0.0 63349 914900-None-26 0.0 63350 914900-None-None 0.0 [63347 rows x 2 columns], string_id usa_score aq30_id 0 111011-EGY.11_1-3365 0.890711 1 111011-EGY.15_1-3365 0.890711 2 111011-EGY.15_1-None 0.890711 3 111011-None-3365 0.890711 4 111011-None-None 0.890711 ... ... ... 63346 914900-None-17 0.000000 63347 914900-None-19 0.000000 63348 914900-None-21 0.000000 63349 914900-None-26 0.000000 63350 914900-None-None 0.000000 [63347 rows x 2 columns], string_id rri_score aq30_id 0 111011-EGY.11_1-3365 2.80 1 111011-EGY.15_1-3365 2.80 2 111011-EGY.15_1-None 2.80 5 111012-EGY.11_1-3365 2.80 6 111012-EGY.15_1-3365 2.80 ... ... ... 68506 None-YEM.5_1-None 4.60 68507 None-ZAF.1_1-None 1.64 68508 None-ZAF.4_1-None 1.64 68509 None-ZAF.9_1-2940 1.64 68510 None-ZAF.9_1-None 1.64 [57890 rows x 2 columns], string_id w_awr_def_tot_score aq30_id 0 111011-EGY.11_1-3365 4.167781 1 111011-EGY.15_1-3365 4.167781 2 111011-EGY.15_1-None 4.167781 3 111011-None-3365 4.226421 4 111011-None-None 4.226421 ... ... ... 68506 None-YEM.5_1-None 4.605401 68507 None-ZAF.1_1-None 2.652450 68508 None-ZAF.4_1-None 2.652450 68509 None-ZAF.9_1-2940 2.652450 68510 None-ZAF.9_1-None 2.652450 [66004 rows x 2 columns]]
for var_name in locals():
if var_name in dfl_merge_new_name:
print(var_name)
df_world_bws_cleaned df_world_bwd_cleaned df_world_iav_cleaned df_world_sev_cleaned df_world_gtd_cleaned df_world_rfr_cleaned df_world_cfr_cleaned df_world_drr_cleaned df_world_ucw_cleaned df_world_cep_cleaned df_world_udw_cleaned df_world_usa_cleaned df_world_rri_cleaned df_world_w_awr_def_tot_score_cleaned
# Getting the cleaned risk scores for Malaysian reservoirs only
MY_partial_identifier = 'MYS'
df_MY_bws_clean = df_world_bws_cleaned[df_world_bws_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_bws_clean)
df_MY_bwd_clean = df_world_bwd_cleaned[df_world_bwd_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_bwd_clean)
df_MY_iav_clean = df_world_iav_cleaned[df_world_iav_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_iav_clean)
df_MY_sev_clean = df_world_sev_cleaned[df_world_sev_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_sev_clean)
df_MY_gtd_clean = df_world_gtd_cleaned[df_world_gtd_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_gtd_clean)
df_MY_rfr_clean = df_world_rfr_cleaned[df_world_rfr_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_rfr_clean)
df_MY_cfr_clean = df_world_cfr_cleaned[df_world_cfr_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_cfr_clean)
df_MY_drr_clean = df_world_drr_cleaned[df_world_drr_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_drr_clean)
df_MY_ucw_clean = df_world_ucw_cleaned[df_world_ucw_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_ucw_clean)
df_MY_cep_clean = df_world_cep_cleaned[df_world_cep_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_cep_clean)
df_MY_udw_clean = df_world_udw_cleaned[df_world_udw_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_udw_clean)
df_MY_usa_clean = df_world_usa_cleaned[df_world_usa_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_usa_clean)
df_MY_rri_clean = df_world_rri_cleaned[df_world_rri_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_rri_clean)
df_MY_w_awr_def_tot_score_clean = df_world_w_awr_def_tot_score_cleaned[df_world_w_awr_def_tot_score_cleaned['string_id'].str.contains(MY_partial_identifier)]
print(df_MY_w_awr_def_tot_score_clean)
string_id bws_score
aq30_id
34100 444037-MYS.10_1-2171 0.0
34101 444037-MYS.2_1-2171 0.0
34102 444037-MYS.3_1-2171 0.0
34103 444037-MYS.3_1-2217 0.0
34104 444037-MYS.3_1-None 0.0
... ... ...
40038 521760-MYS.14_1-2222 0.0
40055 521780-MYS.14_1-2222 0.0
40119 521808-MYS.14_1-2222 0.0
40125 521809-MYS.14_1-2222 0.0
40194 524010-MYS.13_1-None 0.0
[221 rows x 2 columns]
string_id bwd_score
aq30_id
34100 444037-MYS.10_1-2171 0.280569
34101 444037-MYS.2_1-2171 0.280569
34102 444037-MYS.3_1-2171 0.280569
34103 444037-MYS.3_1-2217 0.280569
34104 444037-MYS.3_1-None 0.280569
... ... ...
40038 521760-MYS.14_1-2222 0.001385
40055 521780-MYS.14_1-2222 0.001329
40119 521808-MYS.14_1-2222 0.000664
40125 521809-MYS.14_1-2222 0.000547
40194 524010-MYS.13_1-None 0.463100
[221 rows x 2 columns]
string_id iav_score
aq30_id
34100 444037-MYS.10_1-2171 1.289226
34101 444037-MYS.2_1-2171 1.289226
34102 444037-MYS.3_1-2171 1.289226
34103 444037-MYS.3_1-2217 1.289226
34104 444037-MYS.3_1-None 1.289226
... ... ...
40038 521760-MYS.14_1-2222 1.262480
40055 521780-MYS.14_1-2222 1.191458
40119 521808-MYS.14_1-2222 0.924765
40125 521809-MYS.14_1-2222 0.927093
40194 524010-MYS.13_1-None 3.269535
[221 rows x 2 columns]
string_id sev_score
aq30_id
34100 444037-MYS.10_1-2171 1.939940
34101 444037-MYS.2_1-2171 1.939940
34102 444037-MYS.3_1-2171 1.939940
34103 444037-MYS.3_1-2217 1.939940
34104 444037-MYS.3_1-None 1.939940
... ... ...
40038 521760-MYS.14_1-2222 0.306158
40055 521780-MYS.14_1-2222 0.347437
40119 521808-MYS.14_1-2222 0.635348
40125 521809-MYS.14_1-2222 0.638737
40194 524010-MYS.13_1-None 0.672260
[221 rows x 2 columns]
Empty DataFrame
Columns: [string_id, gtd_score]
Index: []
string_id rfr_score
aq30_id
34100 444037-MYS.10_1-2171 3.655798
34101 444037-MYS.2_1-2171 3.655798
34102 444037-MYS.3_1-2171 3.655798
34103 444037-MYS.3_1-2217 3.655798
34104 444037-MYS.3_1-None 3.655798
... ... ...
40038 521760-MYS.14_1-2222 4.141351
40055 521780-MYS.14_1-2222 4.142712
40119 521808-MYS.14_1-2222 1.562173
40125 521809-MYS.14_1-2222 2.959756
40194 524010-MYS.13_1-None 1.651099
[221 rows x 2 columns]
string_id cfr_score
aq30_id
34100 444037-MYS.10_1-2171 0.541537
34101 444037-MYS.2_1-2171 0.541537
34102 444037-MYS.3_1-2171 0.541537
34103 444037-MYS.3_1-2217 0.541537
34104 444037-MYS.3_1-None 0.541537
... ... ...
40038 521760-MYS.14_1-2222 4.017283
40055 521780-MYS.14_1-2222 0.000000
40119 521808-MYS.14_1-2222 0.000000
40125 521809-MYS.14_1-2222 0.000000
40194 524010-MYS.13_1-None 4.007097
[221 rows x 2 columns]
string_id drr_score
aq30_id
34100 444037-MYS.10_1-2171 2.857626
34101 444037-MYS.2_1-2171 2.857626
34102 444037-MYS.3_1-2171 2.857626
34103 444037-MYS.3_1-2217 2.857626
34104 444037-MYS.3_1-None 2.857626
... ... ...
40038 521760-MYS.14_1-2222 2.744894
40055 521780-MYS.14_1-2222 2.637014
40119 521808-MYS.14_1-2222 2.540840
40125 521809-MYS.14_1-2222 2.598949
40194 524010-MYS.13_1-None 3.317998
[221 rows x 2 columns]
string_id ucw_score
aq30_id
34100 444037-MYS.10_1-2171 5.0
34101 444037-MYS.2_1-2171 5.0
34102 444037-MYS.3_1-2171 5.0
34103 444037-MYS.3_1-2217 5.0
34104 444037-MYS.3_1-None 5.0
... ... ...
64956 None-MYS.6_1-None 5.0
64957 None-MYS.8_1-2283 5.0
64958 None-MYS.8_1-None 5.0
64959 None-MYS.9_1-2251 5.0
64960 None-MYS.9_1-None 5.0
[243 rows x 2 columns]
string_id cep_score
aq30_id
34100 444037-MYS.10_1-2171 1.122045
34101 444037-MYS.2_1-2171 1.122045
34102 444037-MYS.3_1-2171 1.122045
34103 444037-MYS.3_1-2217 1.122045
34104 444037-MYS.3_1-None 1.122045
... ... ...
40037 521760-MYS.13_1-2222 0.839217
40038 521760-MYS.14_1-2222 0.839217
40055 521780-MYS.14_1-2222 0.831462
40119 521808-MYS.14_1-2222 0.860604
40125 521809-MYS.14_1-2222 0.875725
[218 rows x 2 columns]
string_id udw_score
aq30_id
34100 444037-MYS.10_1-2171 0.735103
34101 444037-MYS.2_1-2171 0.735103
34102 444037-MYS.3_1-2171 0.735103
34103 444037-MYS.3_1-2217 0.735103
34104 444037-MYS.3_1-None 0.735103
... ... ...
40038 521760-MYS.14_1-2222 3.726737
40055 521780-MYS.14_1-2222 3.245690
40119 521808-MYS.14_1-2222 3.817542
40125 521809-MYS.14_1-2222 3.816697
40194 524010-MYS.13_1-None 3.037588
[221 rows x 2 columns]
string_id usa_score
aq30_id
34100 444037-MYS.10_1-2171 0.000000
34101 444037-MYS.2_1-2171 0.000000
34102 444037-MYS.3_1-2171 0.000000
34103 444037-MYS.3_1-2217 0.000000
34104 444037-MYS.3_1-None 0.000000
... ... ...
40038 521760-MYS.14_1-2222 4.421503
40055 521780-MYS.14_1-2222 4.002440
40119 521808-MYS.14_1-2222 4.514201
40125 521809-MYS.14_1-2222 4.511235
40194 524010-MYS.13_1-None 3.174109
[221 rows x 2 columns]
string_id rri_score
aq30_id
34100 444037-MYS.10_1-2171 1.96
34101 444037-MYS.2_1-2171 1.96
34102 444037-MYS.3_1-2171 1.96
34103 444037-MYS.3_1-2217 1.96
34104 444037-MYS.3_1-None 1.96
... ... ...
64956 None-MYS.6_1-None 1.96
64957 None-MYS.8_1-2283 1.96
64958 None-MYS.8_1-None 1.96
64959 None-MYS.9_1-2251 1.96
64960 None-MYS.9_1-None 1.96
[243 rows x 2 columns]
string_id w_awr_def_tot_score
aq30_id
34100 444037-MYS.10_1-2171 1.446886
34101 444037-MYS.2_1-2171 1.446886
34102 444037-MYS.3_1-2171 1.446886
34103 444037-MYS.3_1-2217 1.446886
34104 444037-MYS.3_1-None 1.446886
... ... ...
64956 None-MYS.6_1-None 4.738070
64957 None-MYS.8_1-2283 4.738070
64958 None-MYS.8_1-None 4.738070
64959 None-MYS.9_1-2251 4.738070
64960 None-MYS.9_1-None 4.738070
[243 rows x 2 columns]
Based on Github documentation of the dataset (See here: https://github.com/wri/aqueduct30_data_download/blob/master/metadata.md), it is stated that w_awr (weighted aggregated water risk) is the aggregation of all 13 indicators. We can observe how each indicator correlate with w_awr using scatter plot.
# For bws vs awr (1)
df_MY_bws_awr = pd.merge(df_MY_bws_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_bws_awr = df_MY_bws_awr.dropna()
print(df_MY_bws_awr)
# For bwd vs awr (2)
df_MY_bwd_awr = pd.merge(df_MY_bwd_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_bwd_awr = df_MY_bwd_awr.dropna()
print(df_MY_bwd_awr)
# For iav vs awr (3)
df_MY_iav_awr = pd.merge(df_MY_iav_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_iav_awr = df_MY_iav_awr.dropna()
print(df_MY_iav_awr)
# For sev vs awr (4)
df_MY_sev_awr = pd.merge(df_MY_sev_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_sev_awr = df_MY_sev_awr.dropna()
print(df_MY_sev_awr)
# For gtd vs awr (5)
df_MY_gtd_awr = pd.merge(df_MY_gtd_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_gtd_awr = df_MY_gtd_awr.dropna()
print(df_MY_gtd_awr)
# For rfr vs awr (6)
df_MY_rfr_awr = pd.merge(df_MY_rfr_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_rfr_awr = df_MY_rfr_awr.dropna()
print(df_MY_rfr_awr)
# For cfr vs awr (7)
df_MY_cfr_awr = pd.merge(df_MY_cfr_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_cfr_awr = df_MY_cfr_awr.dropna()
print(df_MY_cfr_awr)
# For drr vs awr (8)
df_MY_drr_awr = pd.merge(df_MY_drr_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_drr_awr = df_MY_drr_awr.dropna()
print(df_MY_drr_awr)
# For ucw vs awr (9)
df_MY_ucw_awr = pd.merge(df_MY_ucw_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_ucw_awr = df_MY_ucw_awr.dropna()
print(df_MY_ucw_awr)
# For cep vs awr (10)
df_MY_cep_awr = pd.merge(df_MY_cep_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_cep_awr = df_MY_cep_awr.dropna()
print(df_MY_cep_awr)
# For udw vs awr (11)
df_MY_udw_awr = pd.merge(df_MY_udw_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_udw_awr = df_MY_udw_awr.dropna()
print(df_MY_udw_awr)
# For usa vs awr (12)
df_MY_usa_awr = pd.merge(df_MY_usa_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_usa_awr = df_MY_usa_awr.dropna()
print(df_MY_usa_awr)
# For rri vs awr (13)
df_MY_rri_awr = pd.merge(df_MY_rri_clean, df_MY_w_awr_def_tot_score_clean, on='string_id')
df_MY_rri_awr = df_MY_rri_awr.dropna()
print(df_MY_rri_awr)
string_id bws_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 0.0 1.446886
1 444037-MYS.2_1-2171 0.0 1.446886
2 444037-MYS.3_1-2171 0.0 1.446886
3 444037-MYS.3_1-2217 0.0 1.446886
4 444037-MYS.3_1-None 0.0 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 0.0 2.943302
217 521780-MYS.14_1-2222 0.0 2.344822
218 521808-MYS.14_1-2222 0.0 2.288622
219 521809-MYS.14_1-2222 0.0 2.439230
220 524010-MYS.13_1-None 0.0 2.843826
[221 rows x 3 columns]
string_id bwd_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 0.280569 1.446886
1 444037-MYS.2_1-2171 0.280569 1.446886
2 444037-MYS.3_1-2171 0.280569 1.446886
3 444037-MYS.3_1-2217 0.280569 1.446886
4 444037-MYS.3_1-None 0.280569 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 0.001385 2.943302
217 521780-MYS.14_1-2222 0.001329 2.344822
218 521808-MYS.14_1-2222 0.000664 2.288622
219 521809-MYS.14_1-2222 0.000547 2.439230
220 524010-MYS.13_1-None 0.463100 2.843826
[221 rows x 3 columns]
string_id iav_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 1.289226 1.446886
1 444037-MYS.2_1-2171 1.289226 1.446886
2 444037-MYS.3_1-2171 1.289226 1.446886
3 444037-MYS.3_1-2217 1.289226 1.446886
4 444037-MYS.3_1-None 1.289226 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 1.262480 2.943302
217 521780-MYS.14_1-2222 1.191458 2.344822
218 521808-MYS.14_1-2222 0.924765 2.288622
219 521809-MYS.14_1-2222 0.927093 2.439230
220 524010-MYS.13_1-None 3.269535 2.843826
[221 rows x 3 columns]
string_id sev_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 1.939940 1.446886
1 444037-MYS.2_1-2171 1.939940 1.446886
2 444037-MYS.3_1-2171 1.939940 1.446886
3 444037-MYS.3_1-2217 1.939940 1.446886
4 444037-MYS.3_1-None 1.939940 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 0.306158 2.943302
217 521780-MYS.14_1-2222 0.347437 2.344822
218 521808-MYS.14_1-2222 0.635348 2.288622
219 521809-MYS.14_1-2222 0.638737 2.439230
220 524010-MYS.13_1-None 0.672260 2.843826
[221 rows x 3 columns]
Empty DataFrame
Columns: [gtd_score, string_id, w_awr_def_tot_score]
Index: []
string_id rfr_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 3.655798 1.446886
1 444037-MYS.2_1-2171 3.655798 1.446886
2 444037-MYS.3_1-2171 3.655798 1.446886
3 444037-MYS.3_1-2217 3.655798 1.446886
4 444037-MYS.3_1-None 3.655798 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 4.141351 2.943302
217 521780-MYS.14_1-2222 4.142712 2.344822
218 521808-MYS.14_1-2222 1.562173 2.288622
219 521809-MYS.14_1-2222 2.959756 2.439230
220 524010-MYS.13_1-None 1.651099 2.843826
[221 rows x 3 columns]
string_id cfr_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 0.541537 1.446886
1 444037-MYS.2_1-2171 0.541537 1.446886
2 444037-MYS.3_1-2171 0.541537 1.446886
3 444037-MYS.3_1-2217 0.541537 1.446886
4 444037-MYS.3_1-None 0.541537 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 4.017283 2.943302
217 521780-MYS.14_1-2222 0.000000 2.344822
218 521808-MYS.14_1-2222 0.000000 2.288622
219 521809-MYS.14_1-2222 0.000000 2.439230
220 524010-MYS.13_1-None 4.007097 2.843826
[221 rows x 3 columns]
string_id drr_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 2.857626 1.446886
1 444037-MYS.2_1-2171 2.857626 1.446886
2 444037-MYS.3_1-2171 2.857626 1.446886
3 444037-MYS.3_1-2217 2.857626 1.446886
4 444037-MYS.3_1-None 2.857626 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 2.744894 2.943302
217 521780-MYS.14_1-2222 2.637014 2.344822
218 521808-MYS.14_1-2222 2.540840 2.288622
219 521809-MYS.14_1-2222 2.598949 2.439230
220 524010-MYS.13_1-None 3.317998 2.843826
[221 rows x 3 columns]
string_id ucw_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 5.0 1.446886
1 444037-MYS.2_1-2171 5.0 1.446886
2 444037-MYS.3_1-2171 5.0 1.446886
3 444037-MYS.3_1-2217 5.0 1.446886
4 444037-MYS.3_1-None 5.0 1.446886
.. ... ... ...
238 None-MYS.6_1-None 5.0 4.738070
239 None-MYS.8_1-2283 5.0 4.738070
240 None-MYS.8_1-None 5.0 4.738070
241 None-MYS.9_1-2251 5.0 4.738070
242 None-MYS.9_1-None 5.0 4.738070
[243 rows x 3 columns]
string_id cep_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 1.122045 1.446886
1 444037-MYS.2_1-2171 1.122045 1.446886
2 444037-MYS.3_1-2171 1.122045 1.446886
3 444037-MYS.3_1-2217 1.122045 1.446886
4 444037-MYS.3_1-None 1.122045 1.446886
.. ... ... ...
213 521760-MYS.13_1-2222 0.839217 2.943302
214 521760-MYS.14_1-2222 0.839217 2.943302
215 521780-MYS.14_1-2222 0.831462 2.344822
216 521808-MYS.14_1-2222 0.860604 2.288622
217 521809-MYS.14_1-2222 0.875725 2.439230
[218 rows x 3 columns]
string_id udw_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 0.735103 1.446886
1 444037-MYS.2_1-2171 0.735103 1.446886
2 444037-MYS.3_1-2171 0.735103 1.446886
3 444037-MYS.3_1-2217 0.735103 1.446886
4 444037-MYS.3_1-None 0.735103 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 3.726737 2.943302
217 521780-MYS.14_1-2222 3.245690 2.344822
218 521808-MYS.14_1-2222 3.817542 2.288622
219 521809-MYS.14_1-2222 3.816697 2.439230
220 524010-MYS.13_1-None 3.037588 2.843826
[221 rows x 3 columns]
string_id usa_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 0.000000 1.446886
1 444037-MYS.2_1-2171 0.000000 1.446886
2 444037-MYS.3_1-2171 0.000000 1.446886
3 444037-MYS.3_1-2217 0.000000 1.446886
4 444037-MYS.3_1-None 0.000000 1.446886
.. ... ... ...
216 521760-MYS.14_1-2222 4.421503 2.943302
217 521780-MYS.14_1-2222 4.002440 2.344822
218 521808-MYS.14_1-2222 4.514201 2.288622
219 521809-MYS.14_1-2222 4.511235 2.439230
220 524010-MYS.13_1-None 3.174109 2.843826
[221 rows x 3 columns]
string_id rri_score w_awr_def_tot_score
0 444037-MYS.10_1-2171 1.96 1.446886
1 444037-MYS.2_1-2171 1.96 1.446886
2 444037-MYS.3_1-2171 1.96 1.446886
3 444037-MYS.3_1-2217 1.96 1.446886
4 444037-MYS.3_1-None 1.96 1.446886
.. ... ... ...
238 None-MYS.6_1-None 1.96 4.738070
239 None-MYS.8_1-2283 1.96 4.738070
240 None-MYS.8_1-None 1.96 4.738070
241 None-MYS.9_1-2251 1.96 4.738070
242 None-MYS.9_1-None 1.96 4.738070
[243 rows x 3 columns]
# For bws vs awr (1)
df_world_bws_awr = pd.merge(df_world_bws_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_bws_awr = df_world_bws_awr.dropna()
print(df_world_bws_awr)
# For bwd vs awr (2)
df_world_bwd_awr = pd.merge(df_world_bwd_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_bwd_awr = df_world_bwd_awr.dropna()
print(df_world_bwd_awr)
# For iav vs awr (3)
df_world_iav_awr = pd.merge(df_world_iav_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_iav_awr = df_world_iav_awr.dropna()
print(df_world_iav_awr)
# For sev vs awr (4)
df_world_sev_awr = pd.merge(df_world_sev_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_sev_awr = df_world_sev_awr.dropna()
print(df_world_sev_awr)
# For gtd vs awr (5)
df_world_gtd_awr = pd.merge(df_world_gtd_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_gtd_awr = df_world_gtd_awr.dropna()
print(df_world_gtd_awr)
# For rfr vs awr (6)
df_world_rfr_awr = pd.merge(df_world_rfr_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_rfr_awr = df_world_rfr_awr.dropna()
print(df_world_rfr_awr)
# For cfr vs awr (7)
df_world_cfr_awr = pd.merge(df_world_cfr_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_cfr_awr = df_world_cfr_awr.dropna()
print(df_world_cfr_awr)
# For drr vs awr (8)
df_world_drr_awr = pd.merge(df_world_drr_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_drr_awr = df_world_drr_awr.dropna()
print(df_world_drr_awr)
# For ucw vs awr (9)
df_world_ucw_awr = pd.merge(df_world_ucw_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_ucw_awr = df_world_ucw_awr.dropna()
print(df_world_ucw_awr)
# For cep vs awr (10)
df_world_cep_awr = pd.merge(df_world_cep_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_cep_awr = df_world_cep_awr.dropna()
print(df_world_cep_awr)
# For udw vs awr (11)
df_world_udw_awr = pd.merge(df_world_udw_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_udw_awr = df_world_udw_awr.dropna()
print(df_world_udw_awr)
# For usa vs awr (12)
df_world_usa_awr = pd.merge(df_world_usa_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_usa_awr = df_world_usa_awr.dropna()
print(df_world_usa_awr)
# For rri vs awr (13)
df_world_rri_awr = pd.merge(df_world_rri_cleaned, df_world_w_awr_def_tot_score_cleaned, on='string_id')
df_world_rri_awr = df_world_rri_awr.dropna()
print(df_world_rri_awr)
string_id bws_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 5.0 4.167781
1 111011-EGY.15_1-3365 5.0 4.167781
2 111011-EGY.15_1-None 5.0 4.167781
3 111011-None-3365 5.0 4.226421
4 111011-None-None 5.0 4.226421
... ... ... ...
62895 914900-None-17 5.0 4.051755
62896 914900-None-19 5.0 4.051755
62897 914900-None-21 5.0 4.051755
62898 914900-None-26 5.0 4.051755
62899 914900-None-None 5.0 4.051755
[62900 rows x 3 columns]
string_id bwd_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 4.948243 4.167781
1 111011-EGY.15_1-3365 4.948243 4.167781
2 111011-EGY.15_1-None 4.948243 4.167781
3 111011-None-3365 4.948243 4.226421
4 111011-None-None 4.948243 4.226421
... ... ... ...
62895 914900-None-17 5.000000 4.051755
62896 914900-None-19 5.000000 4.051755
62897 914900-None-21 5.000000 4.051755
62898 914900-None-26 5.000000 4.051755
62899 914900-None-None 5.000000 4.051755
[62900 rows x 3 columns]
string_id iav_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 4.141657 4.167781
1 111011-EGY.15_1-3365 4.141657 4.167781
2 111011-EGY.15_1-None 4.141657 4.167781
3 111011-None-3365 4.141657 4.226421
4 111011-None-None 4.141657 4.226421
... ... ... ...
61485 863099-CAN.8_1-9 2.905139 0.448933
61486 863099-CAN.8_1-None 2.905139 0.373875
61487 863099-None-222 2.905139 0.327702
61488 863099-None-9 2.905139 0.421377
61489 863099-None-None 2.905139 0.327702
[61490 rows x 3 columns]
string_id sev_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 2.887187 4.167781
1 111011-EGY.15_1-3365 2.887187 4.167781
2 111011-EGY.15_1-None 2.887187 4.167781
3 111011-None-3365 2.887187 4.226421
4 111011-None-None 2.887187 4.226421
... ... ... ...
61485 863099-CAN.8_1-9 1.235802 0.448933
61486 863099-CAN.8_1-None 1.235802 0.373875
61487 863099-None-222 1.235802 0.327702
61488 863099-None-9 1.235802 0.421377
61489 863099-None-None 1.235802 0.327702
[61490 rows x 3 columns]
string_id gtd_score w_awr_def_tot_score
0 111014-EGY.2_1-1732 1.028479 4.017506
1 111015-EGY.10_1-1732 1.028479 4.121294
2 111015-EGY.2_1-1732 1.028479 4.121294
3 111015-EGY.2_1-1775 1.162628 4.131568
4 111015-None-1775 1.162628 4.173954
... ... ... ...
8261 None-USA.44_1-1400 1.602974 1.523731
8262 None-USA.44_1-1722 1.188728 1.097968
8263 None-USA.47_1-1400 1.602974 1.523731
8264 None-USA.8_1-1400 1.602974 1.523731
8265 None-VNM.25_1-2122 2.165285 4.168952
[8266 rows x 3 columns]
string_id rfr_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 4.180674 4.167781
1 111011-EGY.15_1-3365 4.180674 4.167781
2 111011-EGY.15_1-None 4.180674 4.167781
3 111011-None-3365 4.180674 4.226421
4 111011-None-None 4.180674 4.226421
... ... ... ...
63340 914900-None-17 0.000000 4.051755
63341 914900-None-19 0.000000 4.051755
63342 914900-None-21 0.000000 4.051755
63343 914900-None-26 0.000000 4.051755
63344 914900-None-None 0.000000 4.051755
[63345 rows x 3 columns]
string_id cfr_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 0.0 4.167781
1 111011-EGY.15_1-3365 0.0 4.167781
2 111011-EGY.15_1-None 0.0 4.167781
3 111011-None-3365 0.0 4.226421
4 111011-None-None 0.0 4.226421
... ... ... ...
63340 914900-None-17 0.0 4.051755
63341 914900-None-19 0.0 4.051755
63342 914900-None-21 0.0 4.051755
63343 914900-None-26 0.0 4.051755
63344 914900-None-None 0.0 4.051755
[63345 rows x 3 columns]
string_id drr_score w_awr_def_tot_score
0 111081-ERI.2_1-3365 2.215140 3.950464
1 111081-ERI.6_1-3365 2.215140 3.950464
2 111081-None-None 2.215140 3.382882
3 111081-SDN.11_1-1775 2.215140 3.498231
4 111081-SDN.11_1-1930 2.215140 3.914832
... ... ... ...
52602 832808-CAN.3_1-352 1.512279 0.403321
52603 832808-CAN.8_1-352 1.512279 0.403321
52604 832809-CAN.12_1-352 1.473108 0.399229
52605 832809-CAN.3_1-352 1.473108 0.399229
52606 832809-CAN.8_1-352 1.473108 0.399229
[52607 rows x 3 columns]
string_id ucw_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 2.046333 4.167781
1 111011-EGY.15_1-3365 2.046333 4.167781
2 111011-EGY.15_1-None 2.046333 4.167781
3 111012-EGY.11_1-3365 2.046333 4.148627
4 111012-EGY.15_1-3365 2.046333 4.148627
... ... ... ...
57694 None-YEM.5_1-None 3.955055 4.605401
57695 None-ZAF.1_1-None 2.005333 2.652450
57696 None-ZAF.4_1-None 2.005333 2.652450
57697 None-ZAF.9_1-2940 2.005333 2.652450
57698 None-ZAF.9_1-None 2.005333 2.652450
[57699 rows x 3 columns]
string_id cep_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 2.000000 4.167781
1 111011-EGY.15_1-3365 2.000000 4.167781
2 111011-EGY.15_1-None 2.000000 4.167781
3 111011-None-3365 2.000000 4.226421
4 111011-None-None 2.000000 4.226421
... ... ... ...
62081 914900-None-17 1.983534 4.051755
62082 914900-None-19 1.983534 4.051755
62083 914900-None-21 1.983534 4.051755
62084 914900-None-26 1.983534 4.051755
62085 914900-None-None 1.983534 4.051755
[62086 rows x 3 columns]
string_id udw_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 0.0 4.167781
1 111011-EGY.15_1-3365 0.0 4.167781
2 111011-EGY.15_1-None 0.0 4.167781
3 111011-None-3365 0.0 4.226421
4 111011-None-None 0.0 4.226421
... ... ... ...
63342 914900-None-17 0.0 4.051755
63343 914900-None-19 0.0 4.051755
63344 914900-None-21 0.0 4.051755
63345 914900-None-26 0.0 4.051755
63346 914900-None-None 0.0 4.051755
[63347 rows x 3 columns]
string_id usa_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 0.890711 4.167781
1 111011-EGY.15_1-3365 0.890711 4.167781
2 111011-EGY.15_1-None 0.890711 4.167781
3 111011-None-3365 0.890711 4.226421
4 111011-None-None 0.890711 4.226421
... ... ... ...
63342 914900-None-17 0.000000 4.051755
63343 914900-None-19 0.000000 4.051755
63344 914900-None-21 0.000000 4.051755
63345 914900-None-26 0.000000 4.051755
63346 914900-None-None 0.000000 4.051755
[63347 rows x 3 columns]
string_id rri_score w_awr_def_tot_score
0 111011-EGY.11_1-3365 2.80 4.167781
1 111011-EGY.15_1-3365 2.80 4.167781
2 111011-EGY.15_1-None 2.80 4.167781
3 111012-EGY.11_1-3365 2.80 4.148627
4 111012-EGY.15_1-3365 2.80 4.148627
... ... ... ...
57885 None-YEM.5_1-None 4.60 4.605401
57886 None-ZAF.1_1-None 1.64 2.652450
57887 None-ZAF.4_1-None 1.64 2.652450
57888 None-ZAF.9_1-2940 1.64 2.652450
57889 None-ZAF.9_1-None 1.64 2.652450
[57890 rows x 3 columns]
Plots of weighteed aggregated water risk (w_awr) against:
in the order above. Side by side comparison between Malaysian reservoirs and global reservoirs are made in the same figure.
Baseline water stress measures the ratio of total water withdrawals to available renewable surface and groundwater supplies.
Water withdrawals include domestic, industrial, irrigation, and livestock consumptive and nonconsumptive uses.
Available renewable water supplies include the impact of upstream consumptive water users and large dams on downstream water availability.
*Higher values indicate more competition among users.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_bws_awr['bws_score'], df_world_bws_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Baseline Water Stress (Global-bws) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_bws_awr['bws_score'], df_MY_bws_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Baseline Water Stress (MY-bws) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m1a,b1a = np.polyfit(df_world_bws_awr['bws_score'], df_world_bws_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_bws_awr['bws_score'], m1a*df_world_bws_awr['bws_score'] + b1a, color = 'fuchsia')
m1b,b1b = np.polyfit(df_MY_bws_awr['bws_score'], df_MY_bws_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_bws_awr['bws_score'], m1b*df_MY_bws_awr['bws_score'] + b1b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Baseline Water Stress (BWS)', fontsize = 14)
plt.xlabel('BWS for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Baseline water depletion measures the ratio of total water consumption to available renewable water supplies.
Total water consumption includes domestic, industrial, irrigation, and livestock consumptive uses. Available renewable water supplies include the impact of upstream consumptive water users and large dams on downstream water availability.
*Higher values indicate larger impact on the local water supply and decreased water availability for downstream users.*
Baseline water depletion is similar to baseline water stress; however, instead of looking at total water withdrawal (consumptive plus nonconsumptive), baseline water depletion is calculated using consumptive withdrawal only.
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_bwd_awr['bwd_score'], df_world_bwd_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Baseline Water Depletion (Global-bwd) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_bwd_awr['bwd_score'], df_MY_bwd_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Baseline Water Depletion (MY-bwd) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m2a,b2a = np.polyfit(df_world_bwd_awr['bwd_score'], df_world_bwd_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_bwd_awr['bwd_score'], m2a*df_world_bwd_awr['bwd_score'] + b2a, color = 'fuchsia')
m2b,b2b = np.polyfit(df_MY_bwd_awr['bwd_score'], df_MY_bwd_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_bwd_awr['bwd_score'], m2b*df_MY_bwd_awr['bwd_score'] + b2b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Baseline Water Depletion (BWD)', fontsize = 14)
plt.xlabel('BWD for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Interannual variability measures the average between year variability of available water supply, including both renewable surface and groundwater supplies.
*Higher values indicate wider variations in available supply from year to year.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_iav_awr['iav_score'], df_world_iav_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Interannual Variability (Global-iav) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_iav_awr['iav_score'], df_MY_iav_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Interannual Variability (MY-iav) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m3a,b3a = np.polyfit(df_world_iav_awr['iav_score'], df_world_iav_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_iav_awr['iav_score'], m3a*df_world_iav_awr['iav_score'] + b3a, color = 'fuchsia')
m3b,b3b = np.polyfit(df_MY_iav_awr['iav_score'], df_MY_iav_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_iav_awr['iav_score'], m3b*df_MY_iav_awr['iav_score'] + b3b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Interannual Variability (IAV)', fontsize = 14)
plt.xlabel('IAV for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Seasonal variability measures the average within-year variability of available water supply, including both renewable surface and groundwater supplies.
*Higher values indicate wider variations of available supply within a year.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_sev_awr['sev_score'], df_world_sev_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Seasonal Variability (Global-sev) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_sev_awr['sev_score'], df_MY_sev_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Seasonal Variability (MY-sev) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m4a,b4a = np.polyfit(df_world_sev_awr['sev_score'], df_world_sev_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_sev_awr['sev_score'], m4a*df_world_sev_awr['sev_score'] + b4a, color = 'fuchsia')
m4b,b4b = np.polyfit(df_MY_sev_awr['sev_score'], df_MY_sev_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_sev_awr['sev_score'], m4b*df_MY_sev_awr['sev_score'] + b4b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Seasonal Variability (SEV)', fontsize = 14)
plt.xlabel('SEV for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Groundwater table decline measures the average decline of the groundwater table as the average change for the period of study (1990–2014). The result is expressed in centimeters per year (cm/yr).
*Higher values indicate higher levels of unsustainable groundwater withdrawals.*
Apparently, there is no recorded data for Malaysian reservoir. This makes determination of GTD severity in Malaysia indeterminate. Linear regression cannot be derived for Malaysian GTD scores.
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_gtd_awr['gtd_score'], df_world_gtd_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Groundwater Table Decline (Global-gtd) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_gtd_awr['gtd_score'], df_MY_gtd_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Groundwater Table Decline (MY-gtd) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m5a,b5a = np.polyfit(df_world_gtd_awr['gtd_score'], df_world_gtd_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_gtd_awr['gtd_score'], m6a*df_world_gtd_awr['gtd_score'] + b5a, color = 'fuchsia')
# m5b,b5b = np.polyfit(df_MY_gtd_awr['gtd_score'], df_MY_gtd_awr['w_awr_def_tot_score'],1)
# plt.plot(df_MY_gtd_awr['gtd_score'], m5b*df_MY_gtd_awr['gtd_score'] + b5b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Groundwater Table Decline (GTD)', fontsize = 14)
plt.xlabel('GTD for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Riverine flood risk measures the percentage of population expected to be affected by riverine flooding in an average year, accounting for existing flood-protection standards. Flood risk is assessed using hazard (inundation caused by river overflow), exposure (population in flood zone), and vulnerability.16 The existing level of flood protection is also incorporated into the risk calculation.
It is important to note that this indicator represents flood risk not in terms of maximum possible impact but rather as average annual impact. The impacts from infrequent, extreme flood years are averaged with more common, less newsworthy flood years to produce the “expected annual affected population.”
*Higher values indicate that a greater proportion of the population is expected to be impacted by riverine floods on average.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_rfr_awr['rfr_score'], df_world_rfr_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Riverine Flood Risk (Global-rfr) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_rfr_awr['rfr_score'], df_MY_rfr_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Riverine Flood Risk (MY-rfr) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m6a,b6a = np.polyfit(df_world_rfr_awr['rfr_score'], df_world_rfr_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_rfr_awr['rfr_score'], m6a*df_world_rfr_awr['rfr_score'] + b6a, color = 'fuchsia')
m6b,b6b = np.polyfit(df_MY_rfr_awr['rfr_score'], df_MY_rfr_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_rfr_awr['rfr_score'], m6b*df_MY_rfr_awr['rfr_score'] + b6b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Riverine Flood Risk (RFR)', fontsize = 14)
plt.xlabel('RFR for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Coastal flood risk measures the percentage of the population expected to be affected by coastal flooding in an average year, accounting for existing flood protection standards. Flood risk is assessed using hazard (inundation caused by storm surge), exposure (population in flood zone), and vulnerability.17 The existing level of flood protection is also incorporated into the risk calculation.
It is important to note that this indicator represents flood risk not in terms of maximum possible impact but rather as average annual impact. The impacts from infrequent, extreme flood years are averaged with more common, less newsworthy flood years to produce the “expected annual affected population.”
*Higher values indicate that a greater proportion of the population is expected to be impacted by coastal floods on average.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_cfr_awr['cfr_score'], df_world_cfr_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Coastal Flood Risk (Global-cfr) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_cfr_awr['cfr_score'], df_MY_cfr_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Coastal Flood Risk (MY-cfr) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m7a,b7a = np.polyfit(df_world_cfr_awr['cfr_score'], df_world_cfr_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_cfr_awr['cfr_score'], m7a*df_world_cfr_awr['cfr_score'] + b7a, color = 'fuchsia')
m7b,b7b = np.polyfit(df_MY_cfr_awr['cfr_score'], df_MY_cfr_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_cfr_awr['cfr_score'], m7b*df_MY_cfr_awr['cfr_score'] + b7b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Coastal Flood Risk (CFR)', fontsize = 14)
plt.xlabel('CFR for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Drought risk measures where droughts are likely to occur, the population and assets exposed, and the vulnerability of the population and assets to adverse effects.
*Higher values indicate higher risk of drought.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_drr_awr['drr_score'], df_world_drr_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Drought Risk (Global-drr) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_drr_awr['drr_score'], df_MY_drr_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Drought Risk (MY-drr) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m8a,b8a = np.polyfit(df_world_drr_awr['drr_score'], df_world_drr_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_drr_awr['drr_score'], m8a*df_world_drr_awr['drr_score'] + b8a, color = 'fuchsia')
m8b,b8b = np.polyfit(df_MY_drr_awr['drr_score'], df_MY_drr_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_drr_awr['drr_score'], m8b*df_MY_drr_awr['drr_score'] + b8b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Drought Risk (DRR)', fontsize = 14)
plt.xlabel('DRR for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Untreated connected wastewater measures the percentage of domestic wastewater that is connected through a sewerage system and not treated to at least a primary treatment level. Wastewater discharge without adequate treatment could expose water bodies, the general public, and ecosystems to pollutants such as pathogens and nutrients.
The indicator compounds two crucial elements of wastewater management: connection and treatment. Low connection rates reflect households’ lack of access to public sewerage systems; the absence of at least primary treatment reflects a country’s lack of capacity (infrastructure, institutional knowledge) to treat wastewater.
Together these factors can indicate the level of a country’s current capacity to manage its domestic wastewater through two main pathways: extremely low connection rates (below 1 percent), and high connection rates with little treatment.
*Higher values indicate higher percentages of point source wastewater discharged without treatment.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_ucw_awr['ucw_score'], df_world_ucw_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Untreated Connected Wastewater (Global-ucw) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_ucw_awr['ucw_score'], df_MY_ucw_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Untreated Connected Wastewater (MY-ucw) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m9a,b9a = np.polyfit(df_world_ucw_awr['ucw_score'], df_world_ucw_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_ucw_awr['ucw_score'], m9a*df_world_ucw_awr['ucw_score'] + b9a, color = 'fuchsia')
# m9b,b9b = np.polyfit(df_MY_ucw_awr['ucw_score'], df_MY_ucw_awr['w_awr_def_tot_score'],1)
# plt.plot(df_MY_ucw_awr['ucw_score'], m9b*df_MY_ucw_awr['ucw_score'] + b9b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Untreated Connected Wastewater (UCW)', fontsize = 14)
plt.xlabel('UCW for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Note that linear regression for Malaysian reservoir is not included due to non-suitable scatter pattern
Coastal eutrophication potential (CEP) measures the potential for riverine loadings of nitrogen (N), phosphorus (P), and silica (Si) to stimulate harmful algal blooms in coastal waters.
The CEP indicator is a useful metric to map where anthropogenic activities produce enough point-source and nonpoint-source pollution to potentially degrade the environment. When N and P are discharged in excess over Si with respect to diatoms, a major type of algae, undesirable algal species often develop. The stimulation of algae leading to large blooms may in turn result in eutrophication and hypoxia (excessive biological growth and decomposition that reduces oxygen available to other organisms).
It is therefore possible to assess the potential for coastal eutrophication from a river’s N, P, and Si loading.
*Higher values indicate higher levels of excess nutrients with respect to silica, creating more favorable conditions for harmful algal growth and eutrophication in coastal waters downstream.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_cep_awr['cep_score'], df_world_cep_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Coastal Eutrophication Potential (Global-cep) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_cep_awr['cep_score'], df_MY_cep_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Coastal Eutrophication Potential (MY-cep) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m10a,b10a = np.polyfit(df_world_cep_awr['cep_score'], df_world_cep_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_cep_awr['cep_score'], m10a*df_world_cep_awr['cep_score'] + b10a, color = 'fuchsia')
m10b,b10b = np.polyfit(df_MY_cep_awr['cep_score'], df_MY_cep_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_cep_awr['cep_score'], m10b*df_MY_cep_awr['cep_score'] + b10b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Coastal Eutrophication Potential (CEP)', fontsize = 14)
plt.xlabel('CEP for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Unimproved/no drinking water reflects the *percentage of the population collecting drinking water from an unprotected dug well or spring, or directly from a river, dam, lake, pond, stream, canal, or irrigation canal (WHO and UNICEF 2017).
Specifically, the indicator aligns with the unimproved and surface water categories of the Joint Monitoring Programme (JMP)—the lowest tiers of drinking water services.
*Higher values indicate areas where people have less access to safe drinking water supplies.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_udw_awr['udw_score'], df_world_udw_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Unimproved/No Drinking Water (Global-udw) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_udw_awr['udw_score'], df_MY_udw_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Unimproved/No Drinking Water (MY-udw) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m11a,b11a = np.polyfit(df_world_udw_awr['udw_score'], df_world_udw_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_udw_awr['udw_score'], m11a*df_world_udw_awr['udw_score'] + b11a, color = 'fuchsia')
m11b,b11b = np.polyfit(df_MY_udw_awr['udw_score'], df_MY_udw_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_udw_awr['udw_score'], m11b*df_MY_udw_awr['udw_score'] + b11b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Unimproved/No Drinking Water (UDW)', fontsize = 14)
plt.xlabel('UDW for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Unimproved/no sanitation reflects the percentage of the population using pit latrines without a slab or platform, hanging/bucket latrines, or directly disposing human waste in fields, forests, bushes, open bodies of water, beaches, other open spaces, or with solid waste (WHO and UNICEF 2017).
Specifically, the indicator aligns with JMP’s unimproved and open defecation categories— the lowest tier of sanitation services.
*Higher values indicate areas where people have less access to improved sanitation services.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_usa_awr['usa_score'], df_world_usa_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Unimproved/No Drinking Sanitation (Global-usa) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_usa_awr['usa_score'], df_MY_usa_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Unimproved/No Drinking Sanitation (MY-usa) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m12a,b12a = np.polyfit(df_world_usa_awr['usa_score'], df_world_usa_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_usa_awr['usa_score'], m12*df_world_usa_awr['usa_score'] + b12a, color = 'fuchsia')
m12b,b12b = np.polyfit(df_MY_usa_awr['usa_score'], df_MY_usa_awr['w_awr_def_tot_score'],1)
plt.plot(df_MY_usa_awr['usa_score'], m12b*df_MY_usa_awr['usa_score'] + b12b, color = 'darksalmon')
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Unimproved/No Drinking Sanitation (USA)', fontsize = 14)
plt.xlabel('USA for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
The Peak RepRisk country ESG risk index quantifies business conduct risk exposure related to environmental, social, and governance (ESG) issues in the corresponding country. The index provides insights into potential financial, reputational, and compliance risks, such as human rights violations and environmental destruction.
RepRisk is a leading business intelligence provider that specializes in ESG and business conduct risk research for companies, projects, sectors, countries, ESG issues, NGOs, and more, by leveraging artificial intelligence and human analysis in 20 languages. WRI has elected to include the Peak RepRisk country ESG risk index in Aqueduct to reflect the broader regulatory and reputational risks that may threaten water quantity, quality, and access.
While the underlying algorithm is proprietary, we believe that our inclusion of the Peak RepRisk country ESG risk index, normally unavailable to the public, is a value-add to the Aqueduct community.
The peak value equals the highest level of the index in a given country over the last two years. *The higher the value, the higher the risk exposure.*
# Figure adjustments
plt.rcParams['figure.figsize'] = [12,8]
# Scatter plot generation
plt.scatter(df_world_rri_awr['rri_score'], df_world_rri_awr['w_awr_def_tot_score'], color = 'blue', marker = 'o',
label = 'Global Peak RepRisk Country ESG Risk Index Water (Global-rri) vs Global Weighted Aggregated Water Risk (G-w_awr)')
plt.scatter(df_MY_rri_awr['rri_score'], df_MY_rri_awr['w_awr_def_tot_score'], color = 'red', marker = 'o',
label = 'Malaysian Peak RepRisk Country ESG Risk Index (MY-udw) vs Malaysian Weighted Aggregated Water Risk (MY-w_awr)')
# Including Linear Regression
m13,b13 = np.polyfit(df_world_rri_awr['rri_score'], df_world_rri_awr['w_awr_def_tot_score'],1)
plt.plot(df_world_rri_awr['rri_score'], m13*df_world_rri_awr['rri_score'] + b13, color = 'fuchsia')
#sns.regplot(df_world_rri_awr['rri_score'], df_world_rri_awr['w_awr_def_tot_score'], ci=95)
# Giving title and label names
plt.title('Weighted Aggregated Water Risk (w_awr) against Peak RepRisk Country ESG Risk Index (RRI)', fontsize = 14)
plt.xlabel('RRI for Malaysia (Red) and Global (Blue) ', fontsize = 14)
plt.ylabel('Weighted Aggregated Water Risk', fontsize = 14)
plt.grid(True)
plt.legend()
plt.show()
Note that the regression for Malaysian reservoirs is not available due to non-suitable pattern.
By using box plots conjured using boxplot() function under matplotlib.pyplot library. We would like to see the outliers of recorded risk for both global and Malaysian water reservoirs. Perhaps by understanding how much outliers we have can give us more insight on how useful the sampled dataset is in generalising the water risk condition for scale at both global and Malaysian levels.
dataframes1 = [df_MY_bws_awr, df_MY_bwd_awr]
dataframes2 = [df_MY_iav_awr, df_MY_sev_awr]
suffix1 = '_1','_2'
df_box_1 = reduce(lambda left, right: pd.merge(left, right, on='string_id', suffixes=suffix1), dataframes1)
print(df_box_1)
df_box_1 = df_box_1.drop(['w_awr_def_tot_score_1', 'w_awr_def_tot_score_2'], axis = 1)
print(df_box_1)
df_box_2 = reduce(lambda left, right: pd.merge(left, right, on='string_id', suffixes=suffix1), dataframes2)
print(df_box_2)
df_box_2 = df_box_2.drop(['w_awr_def_tot_score_1', 'w_awr_def_tot_score_2'], axis = 1)
print(df_box_2)
pd_box_bws_bwd_iav_sev = pd.merge(df_box_1, df_box_2, on = 'string_id')
print(pd_box_bws_bwd_iav_sev)
string_id bws_score w_awr_def_tot_score_1 bwd_score \
0 444037-MYS.10_1-2171 0.0 1.446886 0.280569
1 444037-MYS.2_1-2171 0.0 1.446886 0.280569
2 444037-MYS.3_1-2171 0.0 1.446886 0.280569
3 444037-MYS.3_1-2217 0.0 1.446886 0.280569
4 444037-MYS.3_1-None 0.0 1.446886 0.280569
.. ... ... ... ...
216 521760-MYS.14_1-2222 0.0 2.943302 0.001385
217 521780-MYS.14_1-2222 0.0 2.344822 0.001329
218 521808-MYS.14_1-2222 0.0 2.288622 0.000664
219 521809-MYS.14_1-2222 0.0 2.439230 0.000547
220 524010-MYS.13_1-None 0.0 2.843826 0.463100
w_awr_def_tot_score_2
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
216 2.943302
217 2.344822
218 2.288622
219 2.439230
220 2.843826
[221 rows x 5 columns]
string_id bws_score bwd_score
0 444037-MYS.10_1-2171 0.0 0.280569
1 444037-MYS.2_1-2171 0.0 0.280569
2 444037-MYS.3_1-2171 0.0 0.280569
3 444037-MYS.3_1-2217 0.0 0.280569
4 444037-MYS.3_1-None 0.0 0.280569
.. ... ... ...
216 521760-MYS.14_1-2222 0.0 0.001385
217 521780-MYS.14_1-2222 0.0 0.001329
218 521808-MYS.14_1-2222 0.0 0.000664
219 521809-MYS.14_1-2222 0.0 0.000547
220 524010-MYS.13_1-None 0.0 0.463100
[221 rows x 3 columns]
string_id iav_score w_awr_def_tot_score_1 sev_score \
0 444037-MYS.10_1-2171 1.289226 1.446886 1.939940
1 444037-MYS.2_1-2171 1.289226 1.446886 1.939940
2 444037-MYS.3_1-2171 1.289226 1.446886 1.939940
3 444037-MYS.3_1-2217 1.289226 1.446886 1.939940
4 444037-MYS.3_1-None 1.289226 1.446886 1.939940
.. ... ... ... ...
216 521760-MYS.14_1-2222 1.262480 2.943302 0.306158
217 521780-MYS.14_1-2222 1.191458 2.344822 0.347437
218 521808-MYS.14_1-2222 0.924765 2.288622 0.635348
219 521809-MYS.14_1-2222 0.927093 2.439230 0.638737
220 524010-MYS.13_1-None 3.269535 2.843826 0.672260
w_awr_def_tot_score_2
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
216 2.943302
217 2.344822
218 2.288622
219 2.439230
220 2.843826
[221 rows x 5 columns]
string_id iav_score sev_score
0 444037-MYS.10_1-2171 1.289226 1.939940
1 444037-MYS.2_1-2171 1.289226 1.939940
2 444037-MYS.3_1-2171 1.289226 1.939940
3 444037-MYS.3_1-2217 1.289226 1.939940
4 444037-MYS.3_1-None 1.289226 1.939940
.. ... ... ...
216 521760-MYS.14_1-2222 1.262480 0.306158
217 521780-MYS.14_1-2222 1.191458 0.347437
218 521808-MYS.14_1-2222 0.924765 0.635348
219 521809-MYS.14_1-2222 0.927093 0.638737
220 524010-MYS.13_1-None 3.269535 0.672260
[221 rows x 3 columns]
string_id bws_score bwd_score iav_score sev_score
0 444037-MYS.10_1-2171 0.0 0.280569 1.289226 1.939940
1 444037-MYS.2_1-2171 0.0 0.280569 1.289226 1.939940
2 444037-MYS.3_1-2171 0.0 0.280569 1.289226 1.939940
3 444037-MYS.3_1-2217 0.0 0.280569 1.289226 1.939940
4 444037-MYS.3_1-None 0.0 0.280569 1.289226 1.939940
.. ... ... ... ... ...
216 521760-MYS.14_1-2222 0.0 0.001385 1.262480 0.306158
217 521780-MYS.14_1-2222 0.0 0.001329 1.191458 0.347437
218 521808-MYS.14_1-2222 0.0 0.000664 0.924765 0.635348
219 521809-MYS.14_1-2222 0.0 0.000547 0.927093 0.638737
220 524010-MYS.13_1-None 0.0 0.463100 3.269535 0.672260
[221 rows x 5 columns]
import mplcursors
import mpld3
from mpld3 import plugins
boxplot_1 = pd_box_bws_bwd_iav_sev.boxplot()
boxplot_1.set_title('Boxplot for BWS, BWD, IAV and SEV')
# create the zoom plugin
zoom = plugins.Zoom(button=True, enabled=True)
# add the zoom plugin to the plot
mpld3.plugins.connect(boxplot_1.figure, zoom)
# show the plot
mpld3.show()
Note: if you're in the IPython notebook, mpld3.show() is not the best command
to use. Consider using mpld3.display(), or mpld3.enable_notebook().
See more information at http://mpld3.github.io/quickstart.html .
You must interrupt the kernel to end this command
Serving to http://127.0.0.1:8889/ [Ctrl-C to exit]
127.0.0.1 - - [16/Apr/2023 16:50:57] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 16:50:57] "GET /d3.js HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 16:50:57] "GET /mpld3.js HTTP/1.1" 200 -
stopping Server...
Note that data pertaining to GTD is unavailable for Malaysian reservoirs.
dataframes3 = [df_MY_rfr_awr, df_MY_cfr_awr]
dataframes4 = [df_MY_drr_awr, df_MY_ucw_awr]
suffix1 = '_1','_2'
df_box_3 = reduce(lambda left, right: pd.merge(left, right, on='string_id', suffixes=suffix1), dataframes3)
print(df_box_3)
df_box_3 = df_box_3.drop(['w_awr_def_tot_score_1', 'w_awr_def_tot_score_2'], axis = 1)
print(df_box_3)
df_box_4 = reduce(lambda left, right: pd.merge(left, right, on='string_id', suffixes=suffix1), dataframes4)
print(df_box_4)
df_box_4 = df_box_4.drop(['w_awr_def_tot_score_1', 'w_awr_def_tot_score_2'], axis = 1)
print(df_box_4)
pd_box_rfr_cfr_drr_ucw = pd.merge(df_box_3, df_box_4, on = 'string_id')
print(pd_box_rfr_cfr_drr_ucw)
string_id rfr_score w_awr_def_tot_score_1 cfr_score \
0 444037-MYS.10_1-2171 3.655798 1.446886 0.541537
1 444037-MYS.2_1-2171 3.655798 1.446886 0.541537
2 444037-MYS.3_1-2171 3.655798 1.446886 0.541537
3 444037-MYS.3_1-2217 3.655798 1.446886 0.541537
4 444037-MYS.3_1-None 3.655798 1.446886 0.541537
.. ... ... ... ...
216 521760-MYS.14_1-2222 4.141351 2.943302 4.017283
217 521780-MYS.14_1-2222 4.142712 2.344822 0.000000
218 521808-MYS.14_1-2222 1.562173 2.288622 0.000000
219 521809-MYS.14_1-2222 2.959756 2.439230 0.000000
220 524010-MYS.13_1-None 1.651099 2.843826 4.007097
w_awr_def_tot_score_2
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
216 2.943302
217 2.344822
218 2.288622
219 2.439230
220 2.843826
[221 rows x 5 columns]
string_id rfr_score cfr_score
0 444037-MYS.10_1-2171 3.655798 0.541537
1 444037-MYS.2_1-2171 3.655798 0.541537
2 444037-MYS.3_1-2171 3.655798 0.541537
3 444037-MYS.3_1-2217 3.655798 0.541537
4 444037-MYS.3_1-None 3.655798 0.541537
.. ... ... ...
216 521760-MYS.14_1-2222 4.141351 4.017283
217 521780-MYS.14_1-2222 4.142712 0.000000
218 521808-MYS.14_1-2222 1.562173 0.000000
219 521809-MYS.14_1-2222 2.959756 0.000000
220 524010-MYS.13_1-None 1.651099 4.007097
[221 rows x 3 columns]
string_id drr_score w_awr_def_tot_score_1 ucw_score \
0 444037-MYS.10_1-2171 2.857626 1.446886 5.0
1 444037-MYS.2_1-2171 2.857626 1.446886 5.0
2 444037-MYS.3_1-2171 2.857626 1.446886 5.0
3 444037-MYS.3_1-2217 2.857626 1.446886 5.0
4 444037-MYS.3_1-None 2.857626 1.446886 5.0
.. ... ... ... ...
216 521760-MYS.14_1-2222 2.744894 2.943302 5.0
217 521780-MYS.14_1-2222 2.637014 2.344822 5.0
218 521808-MYS.14_1-2222 2.540840 2.288622 5.0
219 521809-MYS.14_1-2222 2.598949 2.439230 5.0
220 524010-MYS.13_1-None 3.317998 2.843826 5.0
w_awr_def_tot_score_2
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
216 2.943302
217 2.344822
218 2.288622
219 2.439230
220 2.843826
[221 rows x 5 columns]
string_id drr_score ucw_score
0 444037-MYS.10_1-2171 2.857626 5.0
1 444037-MYS.2_1-2171 2.857626 5.0
2 444037-MYS.3_1-2171 2.857626 5.0
3 444037-MYS.3_1-2217 2.857626 5.0
4 444037-MYS.3_1-None 2.857626 5.0
.. ... ... ...
216 521760-MYS.14_1-2222 2.744894 5.0
217 521780-MYS.14_1-2222 2.637014 5.0
218 521808-MYS.14_1-2222 2.540840 5.0
219 521809-MYS.14_1-2222 2.598949 5.0
220 524010-MYS.13_1-None 3.317998 5.0
[221 rows x 3 columns]
string_id rfr_score cfr_score drr_score ucw_score
0 444037-MYS.10_1-2171 3.655798 0.541537 2.857626 5.0
1 444037-MYS.2_1-2171 3.655798 0.541537 2.857626 5.0
2 444037-MYS.3_1-2171 3.655798 0.541537 2.857626 5.0
3 444037-MYS.3_1-2217 3.655798 0.541537 2.857626 5.0
4 444037-MYS.3_1-None 3.655798 0.541537 2.857626 5.0
.. ... ... ... ... ...
216 521760-MYS.14_1-2222 4.141351 4.017283 2.744894 5.0
217 521780-MYS.14_1-2222 4.142712 0.000000 2.637014 5.0
218 521808-MYS.14_1-2222 1.562173 0.000000 2.540840 5.0
219 521809-MYS.14_1-2222 2.959756 0.000000 2.598949 5.0
220 524010-MYS.13_1-None 1.651099 4.007097 3.317998 5.0
[221 rows x 5 columns]
boxplot_2 = pd_box_rfr_cfr_drr_ucw.boxplot()
boxplot_2.set_title('Boxplot for RFR, CFR, DRR and UCW')
# create the zoom plugin
zoom = plugins.Zoom(button=True, enabled=True)
# add the zoom plugin to the plot
mpld3.plugins.connect(boxplot_2.figure, zoom)
# show the plot
mpld3.show()
Note: if you're in the IPython notebook, mpld3.show() is not the best command
to use. Consider using mpld3.display(), or mpld3.enable_notebook().
See more information at http://mpld3.github.io/quickstart.html .
You must interrupt the kernel to end this command
Serving to http://127.0.0.1:8889/ [Ctrl-C to exit]
127.0.0.1 - - [16/Apr/2023 17:01:56] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 17:01:56] "GET /d3.js HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 17:01:56] "GET /mpld3.js HTTP/1.1" 200 -
stopping Server...
dataframes5 = [df_MY_cep_awr, df_MY_usa_awr]
suffix1 = '_1','_2'
df_box_5 = reduce(lambda left, right: pd.merge(left, right, on='string_id', suffixes=suffix1), dataframes5)
print(df_box_5)
df_box_5 = df_box_5.drop(['w_awr_def_tot_score_1', 'w_awr_def_tot_score_2'], axis = 1)
print(df_box_5)
pd_box_cep_usa_rri_awr = pd.merge(df_box_5, df_MY_rri_awr, on = 'string_id')
print(pd_box_cep_usa_rri_awr)
string_id cep_score w_awr_def_tot_score_1 usa_score \
0 444037-MYS.10_1-2171 1.122045 1.446886 0.000000
1 444037-MYS.2_1-2171 1.122045 1.446886 0.000000
2 444037-MYS.3_1-2171 1.122045 1.446886 0.000000
3 444037-MYS.3_1-2217 1.122045 1.446886 0.000000
4 444037-MYS.3_1-None 1.122045 1.446886 0.000000
.. ... ... ... ...
213 521760-MYS.13_1-2222 0.839217 2.943302 4.421503
214 521760-MYS.14_1-2222 0.839217 2.943302 4.421503
215 521780-MYS.14_1-2222 0.831462 2.344822 4.002440
216 521808-MYS.14_1-2222 0.860604 2.288622 4.514201
217 521809-MYS.14_1-2222 0.875725 2.439230 4.511235
w_awr_def_tot_score_2
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
213 2.943302
214 2.943302
215 2.344822
216 2.288622
217 2.439230
[218 rows x 5 columns]
string_id cep_score usa_score
0 444037-MYS.10_1-2171 1.122045 0.000000
1 444037-MYS.2_1-2171 1.122045 0.000000
2 444037-MYS.3_1-2171 1.122045 0.000000
3 444037-MYS.3_1-2217 1.122045 0.000000
4 444037-MYS.3_1-None 1.122045 0.000000
.. ... ... ...
213 521760-MYS.13_1-2222 0.839217 4.421503
214 521760-MYS.14_1-2222 0.839217 4.421503
215 521780-MYS.14_1-2222 0.831462 4.002440
216 521808-MYS.14_1-2222 0.860604 4.514201
217 521809-MYS.14_1-2222 0.875725 4.511235
[218 rows x 3 columns]
string_id cep_score usa_score rri_score \
0 444037-MYS.10_1-2171 1.122045 0.000000 1.96
1 444037-MYS.2_1-2171 1.122045 0.000000 1.96
2 444037-MYS.3_1-2171 1.122045 0.000000 1.96
3 444037-MYS.3_1-2217 1.122045 0.000000 1.96
4 444037-MYS.3_1-None 1.122045 0.000000 1.96
.. ... ... ... ...
213 521760-MYS.13_1-2222 0.839217 4.421503 1.96
214 521760-MYS.14_1-2222 0.839217 4.421503 1.96
215 521780-MYS.14_1-2222 0.831462 4.002440 1.96
216 521808-MYS.14_1-2222 0.860604 4.514201 1.96
217 521809-MYS.14_1-2222 0.875725 4.511235 1.96
w_awr_def_tot_score
0 1.446886
1 1.446886
2 1.446886
3 1.446886
4 1.446886
.. ...
213 2.943302
214 2.943302
215 2.344822
216 2.288622
217 2.439230
[218 rows x 5 columns]
boxplot_3 = pd_box_cep_usa_rri_awr.boxplot()
boxplot_3.set_title('Boxplot for CEP, USA, RRI and W_AWR')
# create the zoom plugin
zoom = plugins.Zoom(button=True, enabled=True)
# add the zoom plugin to the plot
mpld3.plugins.connect(boxplot_3.figure, zoom)
# show the plot
mpld3.show()
Note: if you're in the IPython notebook, mpld3.show() is not the best command
to use. Consider using mpld3.display(), or mpld3.enable_notebook().
See more information at http://mpld3.github.io/quickstart.html .
You must interrupt the kernel to end this command
Serving to http://127.0.0.1:8889/ [Ctrl-C to exit]
127.0.0.1 - - [16/Apr/2023 17:13:26] "GET / HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 17:13:26] "GET /d3.js HTTP/1.1" 200 - 127.0.0.1 - - [16/Apr/2023 17:13:26] "GET /mpld3.js HTTP/1.1" 200 -
stopping Server...
For reference, the following table is used for risk categorization.
To observe the statistical distribution, we can analyse its normal distribution by using histogram. Quantitative determination of statistical evaluation of the dataset can be done by using describe() function for querying the following information:
# For reduced worldwide dataset
for df in new_clean_risk:
print(df.describe())
bws_score
count 62900.000000
mean 1.614017
std 1.928062
min 0.000000
25% 0.000000
50% 0.404393
75% 3.306653
max 5.000000
bwd_score
count 62900.000000
mean 1.368724
std 1.686260
min 0.000000
25% 0.019346
50% 0.592350
75% 2.167433
max 5.000000
iav_score
count 61490.000000
mean 1.971766
std 1.197009
min 0.089506
25% 1.123136
50% 1.597281
75% 2.410017
max 5.000000
sev_score
count 61490.000000
mean 2.007989
std 1.077765
min 0.086811
25% 1.160893
50% 1.879658
75% 2.682340
max 5.000000
gtd_score
count 8266.000000
mean 1.455618
std 0.954981
min 0.000000
25% 0.850489
50% 1.221637
75% 2.043840
max 5.000000
rfr_score
count 63345.000000
mean 2.135573
std 1.459553
min 0.000000
25% 0.792551
50% 2.188583
75% 3.514901
max 5.000000
cfr_score
count 63345.000000
mean 0.429704
std 1.048393
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 5.000000
drr_score
count 52607.000000
mean 2.288582
std 0.980051
min 0.004345
25% 1.651795
50% 2.385077
75% 3.019026
max 4.800975
ucw_score
count 57699.000000
mean 3.045088
std 1.633160
min 0.000000
25% 1.874760
50% 3.262335
75% 5.000000
max 5.000000
cep_score
count 62086.000000
mean 2.095437
std 0.871943
min 0.000000
25% 1.597785
50% 1.978650
75% 2.652383
max 5.000000
udw_score
count 63347.000000
mean 1.854942
std 1.823924
min 0.000000
25% 0.000000
50% 1.378512
75% 3.242261
max 5.000000
usa_score
count 63347.000000
mean 2.476417
std 2.197224
min 0.000000
25% 0.000000
50% 2.608720
75% 4.843105
max 5.000000
rri_score
count 57890.000000
mean 2.409447
std 1.275619
min 0.000000
25% 1.240000
50% 2.500000
75% 3.333333
max 5.000000
w_awr_def_tot_score
count 66004.000000
mean 2.465465
std 1.333274
min 0.000000
25% 1.316718
50% 2.507625
75% 3.635255
max 5.000000
# Using Malaysian reservoir scaled down dataset with corrected values
MY_partial_identifier = 'MYS'
df_reduced_MY_pre = df_reduced_basin[df_reduced_basin['string_id'].str.contains(MY_partial_identifier)]
print(df_reduced_MY_pre)
string_id bws_score bwd_score iav_score sev_score \
34099 444037-MYS.10_1-2171 0.0 0.280569 1.289226 1.93994
34100 444037-MYS.2_1-2171 0.0 0.280569 1.289226 1.93994
34101 444037-MYS.3_1-2171 0.0 0.280569 1.289226 1.93994
34102 444037-MYS.3_1-2217 0.0 0.280569 1.289226 1.93994
34103 444037-MYS.3_1-None 0.0 0.280569 1.289226 1.93994
... ... ... ... ... ...
64951 None-MYS.6_1-None NaN NaN NaN NaN
64952 None-MYS.8_1-2283 NaN NaN NaN NaN
64953 None-MYS.8_1-None NaN NaN NaN NaN
64954 None-MYS.9_1-2251 NaN NaN NaN NaN
64955 None-MYS.9_1-None NaN NaN NaN NaN
gtd_score rfr_score cfr_score drr_score ucw_score cep_score \
34099 NaN 3.655798 0.541537 2.857626 5.0 1.122045
34100 NaN 3.655798 0.541537 2.857626 5.0 1.122045
34101 NaN 3.655798 0.541537 2.857626 5.0 1.122045
34102 NaN 3.655798 0.541537 2.857626 5.0 1.122045
34103 NaN 3.655798 0.541537 2.857626 5.0 1.122045
... ... ... ... ... ... ...
64951 NaN NaN NaN NaN 5.0 NaN
64952 NaN NaN NaN NaN 5.0 NaN
64953 NaN NaN NaN NaN 5.0 NaN
64954 NaN NaN NaN NaN 5.0 NaN
64955 NaN NaN NaN NaN 5.0 NaN
udw_score usa_score rri_score w_awr_def_tot_score
34099 0.735103 0.0 1.96 1.446886
34100 0.735103 0.0 1.96 1.446886
34101 0.735103 0.0 1.96 1.446886
34102 0.735103 0.0 1.96 1.446886
34103 0.735103 0.0 1.96 1.446886
... ... ... ... ...
64951 NaN NaN 1.96 4.738070
64952 NaN NaN 1.96 4.738070
64953 NaN NaN 1.96 4.738070
64954 NaN NaN 1.96 4.738070
64955 NaN NaN 1.96 4.738070
[243 rows x 15 columns]
print(df_reduced_MY_pre.describe())
bws_score bwd_score iav_score sev_score gtd_score rfr_score \
count 221.000000 221.000000 221.000000 221.000000 0.0 221.000000
mean 0.088378 0.163181 1.245548 1.118349 NaN 2.849137
std 0.250046 0.202946 0.427030 0.479948 NaN 1.048977
min 0.000000 0.000202 0.769569 0.302665 NaN 0.349783
25% 0.000000 0.013720 0.969633 0.805814 NaN 1.879655
50% 0.000000 0.067504 1.184185 0.944390 NaN 3.068185
75% 0.000000 0.207984 1.405409 1.504838 NaN 3.976323
max 1.070376 0.843650 3.269535 2.399665 NaN 4.142712
cfr_score drr_score ucw_score cep_score udw_score usa_score \
count 221.000000 221.000000 243.0 218.000000 221.000000 221.000000
mean 1.597623 2.568107 5.0 1.422830 2.460366 0.383874
std 1.397664 0.216480 0.0 1.086979 0.849169 1.154883
min 0.000000 2.159738 5.0 0.711098 0.000000 0.000000
25% 0.009717 2.420870 5.0 0.872887 2.174635 0.000000
50% 1.391160 2.570406 5.0 0.947002 2.789204 0.000000
75% 2.648834 2.751034 5.0 1.087275 3.038372 0.000000
max 4.018591 3.317998 5.0 4.174733 3.818489 4.515601
rri_score w_awr_def_tot_score
count 2.430000e+02 243.000000
mean 1.960000e+00 2.025160
std 8.010104e-15 0.915227
min 1.960000e+00 1.158993
25% 1.960000e+00 1.541547
50% 1.960000e+00 1.685280
75% 1.960000e+00 1.948246
max 1.960000e+00 4.738070
For visualization of distribution curves, refer to scatter matrix section
We can use scatter_matrix() from package pandas.plotting package. It is found that one can obtain histogram or distribution curve by observing the diagonal component of scatter matrix output.
# Importing the necessary package
from pandas.plotting import scatter_matrix
# Testing with original set without modifications
scatter_matrix(df_reduced_basin, figsize=(15,15), diagonal = 'kde')
plt.show()
# Using Malaysian reservoir scaled down dataset with corrected values
MY_partial_identifier = 'MYS'
df_reduced_MY = df_reduced_basin[df_reduced_basin['string_id'].str.contains(MY_partial_identifier)]
df_reduced_MY = df_reduced_MY.drop(['gtd_score', 'ucw_score', 'cep_score', 'udw_score', 'usa_score', 'rri_score'], axis = 1)
print(df_reduced_MY)
scatter_matrix(df_reduced_MY, figsize=(15,15), diagonal = 'kde')
plt.show()
# Do note that columns are being drop for error handling so that we can circumvent errors due to some kind of error caused
# by non-invertable matrix generated by the original df_reduced_MY DataFrame
string_id bws_score bwd_score iav_score sev_score \
34099 444037-MYS.10_1-2171 0.0 0.280569 1.289226 1.93994
34100 444037-MYS.2_1-2171 0.0 0.280569 1.289226 1.93994
34101 444037-MYS.3_1-2171 0.0 0.280569 1.289226 1.93994
34102 444037-MYS.3_1-2217 0.0 0.280569 1.289226 1.93994
34103 444037-MYS.3_1-None 0.0 0.280569 1.289226 1.93994
... ... ... ... ... ...
64951 None-MYS.6_1-None NaN NaN NaN NaN
64952 None-MYS.8_1-2283 NaN NaN NaN NaN
64953 None-MYS.8_1-None NaN NaN NaN NaN
64954 None-MYS.9_1-2251 NaN NaN NaN NaN
64955 None-MYS.9_1-None NaN NaN NaN NaN
rfr_score cfr_score drr_score w_awr_def_tot_score
34099 3.655798 0.541537 2.857626 1.446886
34100 3.655798 0.541537 2.857626 1.446886
34101 3.655798 0.541537 2.857626 1.446886
34102 3.655798 0.541537 2.857626 1.446886
34103 3.655798 0.541537 2.857626 1.446886
... ... ... ... ...
64951 NaN NaN NaN 4.738070
64952 NaN NaN NaN 4.738070
64953 NaN NaN NaN 4.738070
64954 NaN NaN NaN 4.738070
64955 NaN NaN NaN 4.738070
[243 rows x 9 columns]
We can use corr() to investigate correlation of the variables. The function generates heatmap which represents the level of proportionality between variables (columnms) of a dataframe.
corrmat0 = df_basin.corr()
feature_ind0 = corrmat0.index
fig, ax = plt.subplots(figsize=(40,40))
# Plotting the correlation matrix as sns heat map for original global dataset
heatmap0 = sns.heatmap(df_basin[feature_ind0].corr(), annot = True, cmap = "RdYlGn")
# create the zoom plugin
zoom = plugins.Zoom(button=True, enabled=True)
# add the zoom plugin to the plot
mpld3.plugins.connect(heatmap0.figure, zoom)
# Display the plot with the plugins
mpld3.display()
corrmat1 = df_reduced_basin.corr()
feature_ind1 = corrmat1.index
plt.figure(figsize=(12,12))
# Plotting the correlation matrix as sns heat map for reduced global dataset
sns.heatmap(df_reduced_basin[feature_ind1].corr(), annot = True, cmap = "RdYlGn")
plt.show()
# For reduced Malaysian reservoir dataset
corrmat2 = df_reduced_MY_pre.corr()
feature_ind2 = corrmat2.index
plt.figure(figsize=(12,12))
# Plotting the correlation matrix as sns heat map for reduced global dataset
sns.heatmap(df_reduced_MY_pre[feature_ind2].corr(), annot = True, cmap = "RdYlGn")
plt.show()
Analysis of the EDA is made as following:
It was found that the size of the dataset is quite formidable since it has about 68.5k entries with 57 columns. To make the analysis worthwhile of Author's time, the studied dataset had been reduced from original dataset. The studied size had been shinked down to about 14 columns (15 including string_id).
Modularization of the dataset had proven very useful in analysis since, by using modularised dataset, the Author had been able to conduct scatter plotting and linear regression to determine relationship between weighted aggregated water risk (w_awr) with respect to 13 risk indicators.
The Author was able to demonstrate how Malaysian reservoirs readings and evaluations compare with global dataset. Linear regression had been made to observe the dependency of w_awr with each 13 indicators for both global and Malaysian dataset. Following are the breakdowns for each indicators.
a. Weighted Aggregated Water Risk (w_awr) vs Baseline Water Stress (bws)
It is demonstrated that BWS is linearly related to w_awr with positive relationship. It is seen that Malaysian reservoirs are recorded to have mostly minimum score on BWS with exception of few reservoirs. This tells us that this indicator is not of immediate concern. As for linear regression analysis, we can see that regression for Malaysian reservoirs is less steep than that of global dataset, indicating that BWS makes up for lesser proportion of w_awr for our national reservoirs.
b. Weighted Aggregated Water Risk (w_awr) vs Baseline Water Depletion (bwd)
Similar to BWS, BWD has linear positive relationship with w_awr for both global and Malaysian reservoirs. It is observed that this too is not of iommediate concern since BWD scores is not significant enough since all datapoint lie below 1. Similar to BWS, this indicator makes up less proportion for Malaysian w_awr scores since the slope is much lower than global regression.
c. Weighted Aggregated Water Risk (w_awr) vs Interannual Variability (iav)
IAV is observed to have positive linear relationship with w_awr for both global and Malaysian datasets. This indicator may be of interest since albeit of most reservoirs score on the lower end of the range, some reservoirs are actually above 2.5. This means that this indicator is within medium - high risk for Malaysian reservoirs. The Malaysian regression is still lower than global regression, indicating that this indicator affects w_awr score in milder manner to Malaysian reservoirs than it is to global dataset.
d. Weighted Aggregated Water Risk (w_awr) vs Seasonal Variability (sev)
SEV is observed to have positive linear relationship with w_awr for global dataset but negative linear relationship for Malaysian reservoirs. Might be caused by the fact that Malaysia is within equator line, making seasonal variability significantly lower as compared to other countries further away from the Equator. Though it must be noted that some of Malaysian reservoirs are very much still moderately affected by seasonal variability.
e. Weighted Aggregated Water Risk (w_awr) vs Groundwater Table Decline (gtd)
GTD is observed to have positive linear relationship with w_awr for global dataset but there is no measurement made with respect to GTD on Malaysian reservoirs.
f. Weighted Aggregated Water Risk (w_awr) vs Riverine Flood Risk (rfr)
RFR is observed to have positive linear relationship with w_awr for global dataset and Malaysian dataset. The slope for both global and local regression is at somewhat equal steepness though the y-intercept of Malaysian reservoirs regression appear to be at lower value. Very important note here is that there is quite significant amount of reservoirs scoring on the higher end of the spectrum with some exceeding 4, which is considered between high and very high on the range.
g. Weighted Aggregated Water Risk (w_awr) vs Coastal Flood Risk (cfr)
CFR is observed to have positive linear relationship with w_awr for Malaysian dataset but negative linear relationship for global dataset. This indicates that Malaysia is somewhat unique in this aspect compared to globa dataset. Though it must be mentioned that the score distributions is somewhat dense on both extreme ends of Malaysian sampled score.
h. Weighted Aggregated Water Risk (w_awr) vs Drought Risk (drr)
DRR is observed to have positive linear relationship with w_awr for global dataset and Malaysian dataset. It is important to note that Malaysian datapoints appear to have lower deviation from each other, scoring between medium to medium highof the scale for this indicator. As for the slope fo the regression, we can observe that global regression is slightly steeper with higher y-intercept (baseline correlation) than Malaysia. Local w_awr dependency with this indicator ranges from low medium to high.
i. Weighted Aggregated Water Risk (w_awr) vs Untreated Connected Wastewater (ucw)
UCW is observed to have positive linear relationship with w_awr for global dataset but regression cannot be made for Malaysian reservoirs. This is because all available measurement/scores pertaining to Malaysia reservoirs are on the absolute maximum of the spectrum, with all datapoints scoring 5 (extremely high) in UCW. This means that untreated discharge of wastewater is extremely prevalent in Malaysia. This should be a major concern for policymakers, well at least with this interpretation. Local w_awr dependency with this indicator ranges from low to high.
j. Weighted Aggregated Water Risk (w_awr) vs Coastal Eutrophication Potential (cep)
CEP is observed to have positive linear relationship with w_awr for global dataset and somewhat constant for Malaysian dataset. Local w_awr dependency with this indicator ranges from medium to high. It is also observed that most Malaysian reservoirs score in the lower end of the indicator score, indicating that most local reservoirs have lower eutrophication potential.
k. Weighted Aggregated Water Risk (w_awr) vs Unimproved/No Drinking Water (udw)
UDW is observed to have positive linear relationship with w_awr for global dataset and Malaysian dataset. The steepness of the local linear regression is lower than global dataset linear regression, indication of lower w_awr dependency on this indicator. The distribution of UDW scored among Malaysian reservoirs is very distributed, ranging from very low to high.
l. Weighted Aggregated Water Risk (w_awr) vs Unimproved/No Sanitation (usa)
USA is observed to have positive linear relationship with w_awr for global dataset and Malaysian dataset. Even more surprisingly, it is observed that linear regression for both global and Malaysian dataset follow each other very closely, as if they are almost collinear with each other. Most Malaysian reservoirs score very low on this indicators with exception of few outliers on the very high end of the range. It can be said that local u_awr score dependency on this indicator ranges from low-medium to medium-high.
m. Weighted Aggregated Water Risk (w_awr) vs Peak RepRisk Country ESG Risk Index (rri)
RRI is observed to have positive linear relationship with w_awr for global dataset. However, regression cannot be made for Malaysian dataset because it appears that the steepness would be 90 degrees vertical if regression is to be done. This indicates that all Malaysian datapoints have the same or near same score for RRI scores, translating to infinitesimal standard deviation. The dependency of local w_awr ranges between low-medium to medium-high with a (?) datapoint being anomalous, scoring near 5 points for w_awr score.
The Author had limited the boxplot to only local dataset so that one can evaluate how statistically significant this dataset is for Malaysian case study. Some indicators have very narrow median, while other may have large median range. Notable observations of extremely narrow median range would be for RRI, USA, UCW and BWS scores, while CEP score seems to contain a lot of outliers above 75% percentile.
The Author had included the results for following statistical parameters from dataset evaluation using describe():
a. count
b. mean or average
c. standard deviation (square the value for variance)
d. minimum datapoint
e. maximum datapoint
f. 1/4 percentile threshold
g. 3/4 percentile threshold
h. maximum datapoint
Evaluation had been made for both reduced global and local dataset.
Using scatter_matrix function, the Author had been able to visualise interdependencies of the score types with each other. The diagonal component represents the distribution curve. Unfortunately, linear regression was unable to be embedded within the matrix.
The Author had used heatmap to visualise how intense each indicator scores affect other indicator scores. This is made possible using Seaborn package.